Referencing,
7 Series FPGAs Overview
7 Series FPGAs Memory Resources
Register File
Starting with the first constraint of the design, how to layout the register file for the SIMD machine. Artix-7 XC7A200T is the target which has capacity for 365 36-Kbit block RAMs (32-Kbit data, 4-Kbit parity). Block RAMs are symmetrical dual port (read or write). Each port takes a 16-bit bit indexed address, and returns a 36-bit (32-bits data, 4-bits parity) bit addressed sliding window into the memory.
The ALU design requires 2 read ports and 1 write port. It would be possible to get 2 read ports by duplicating writes and splitting memory to 2 copies. However I'd like to be able to use all of the memory. Since the majority of operations are going to stream through consecutive addresses in the register file, can leverage the sliding window to get 2 consecutive bits per read, and pipeline such that even clocks pull 2-bits for the first read operand, and odd clocks pull 2-bits for the 2nd read operand. So ignoring parity, one possible configuration is to set the write port to only write 16-bits (16 lanes), and have the read port fetch 32-bits,
In this example each 36-Kbit block RAM supports 16 SIMD lanes each with 2-Kbit of register file. Throughput of the register file provides an upper bound on performance. For a given clock rate, and assuming 256 block RAMs (70%) are used for the register file, limiting throughput is:
1-bit ops/second = clockRate * 256 block rams * 16 lanes
A 128 Mhz machine would thus be limited to,
128 M * 256 blocks * 16 lanes = around 512 Gop/sec
Where each op provides 1-bit of a variable precision ALU operation. My target for video output is the NES resolution: 256x224 at around 59.94 Hz (non-interlaced) and only NTSC monochrome (works on any classic TV). Target is roughly 3.4 Mpix/second. Dividing out to ops/pixel = 150 Kop/pixel. Even if this rough estimate is way too optimistic (which it is), going to have a tremendous amount of perf/pixel for this neo-vintage arcade machine.
Continuing, 256 blocks * 16 lanes = 4096 lane SIMD machine, but the XC7A200T only has 33,650 slices. Using 70% again: 33650 slices * 0.7 / 4096 lanes = 5.7 slices/lane. Yes way too optimistic. Slices per lane for the 1-bit/clock return ALU will limit the peak number of lanes in the machine, which will reduce limiting throughput, but on the positive side, will result in more memory per lane. Can work backwards from possible block RAM configurations to get slice/ALU targets, and this time going to round down to only using 16k slices,
More Constraints
Block RAMs only have byte, not bit, write enable, so if it is possible to reach the 2 lane or better configuration, this rules out building any kind of per-lane predication which depends on disabling register file writes per lane. Going to work through ALU design without needing lane write enable. Likewise original thoughts on NPU have been transformed quite a lot based on real constrains. Focusing on fused ALU+NPU: where ALU operands can come from other edges in the network graph.
More next time ...
7 Series FPGAs Overview
7 Series FPGAs Memory Resources
Register File
Starting with the first constraint of the design, how to layout the register file for the SIMD machine. Artix-7 XC7A200T is the target which has capacity for 365 36-Kbit block RAMs (32-Kbit data, 4-Kbit parity). Block RAMs are symmetrical dual port (read or write). Each port takes a 16-bit bit indexed address, and returns a 36-bit (32-bits data, 4-bits parity) bit addressed sliding window into the memory.
The ALU design requires 2 read ports and 1 write port. It would be possible to get 2 read ports by duplicating writes and splitting memory to 2 copies. However I'd like to be able to use all of the memory. Since the majority of operations are going to stream through consecutive addresses in the register file, can leverage the sliding window to get 2 consecutive bits per read, and pipeline such that even clocks pull 2-bits for the first read operand, and odd clocks pull 2-bits for the 2nd read operand. So ignoring parity, one possible configuration is to set the write port to only write 16-bits (16 lanes), and have the read port fetch 32-bits,
BIT ADDRESS SIMD LANE
--- ------- ---------
0 n (lane & 15) = 0
1 n (lane & 15) = 1
2 n (lane & 15) = 2
...
15 n (lane & 15) = 15
16 n+1 (lane & 15) = 0
17 n+1 (lane & 15) = 1
18 n+1 (lane & 15) = 2
...
31 n+1 (lane & 15) = 15
In this example each 36-Kbit block RAM supports 16 SIMD lanes each with 2-Kbit of register file. Throughput of the register file provides an upper bound on performance. For a given clock rate, and assuming 256 block RAMs (70%) are used for the register file, limiting throughput is:
1-bit ops/second = clockRate * 256 block rams * 16 lanes
A 128 Mhz machine would thus be limited to,
128 M * 256 blocks * 16 lanes = around 512 Gop/sec
Where each op provides 1-bit of a variable precision ALU operation. My target for video output is the NES resolution: 256x224 at around 59.94 Hz (non-interlaced) and only NTSC monochrome (works on any classic TV). Target is roughly 3.4 Mpix/second. Dividing out to ops/pixel = 150 Kop/pixel. Even if this rough estimate is way too optimistic (which it is), going to have a tremendous amount of perf/pixel for this neo-vintage arcade machine.
Continuing, 256 blocks * 16 lanes = 4096 lane SIMD machine, but the XC7A200T only has 33,650 slices. Using 70% again: 33650 slices * 0.7 / 4096 lanes = 5.7 slices/lane. Yes way too optimistic. Slices per lane for the 1-bit/clock return ALU will limit the peak number of lanes in the machine, which will reduce limiting throughput, but on the positive side, will result in more memory per lane. Can work backwards from possible block RAM configurations to get slice/ALU targets, and this time going to round down to only using 16k slices,
256 blocks * 16 lanes, 2-Kbit/lane, 4 slices/lane
256 blocks * 8 lanes, 4-Kbit/lane, 8 slices/lane
256 blocks * 4 lanes, 8-Kbit/lane, 16 slices/lane
256 blocks * 2 lanes, 16-Kbit/lane, 32 slices/lane
256 blocks * 1 lanes, 32-Kbit/lane, 64 slices/lane
More Constraints
Block RAMs only have byte, not bit, write enable, so if it is possible to reach the 2 lane or better configuration, this rules out building any kind of per-lane predication which depends on disabling register file writes per lane. Going to work through ALU design without needing lane write enable. Likewise original thoughts on NPU have been transformed quite a lot based on real constrains. Focusing on fused ALU+NPU: where ALU operands can come from other edges in the network graph.
More next time ...