Wednesday, April 29, 2020

Inferring LUT and Block RAM

As part of my testing of the latch-based i4004 emulation, I constructed an emulation of a complete, if minimal, MCS-4 system consisting of one i4004 CPU, one i4001 ROM, and one i4002 RAM. My intent was to run the short test program that loads by default into the i400x analyzer. But I never got that far.

Why not? Because I noticed that the resource requirements for my latch-based i4002 RAM were off the charts: 550 slice registers! What happened?

It turns out that LUT RAM requires a clock edge to perform a write. This is noted in the documentation but I overlooked it. When I removed the posedge clause from always block, the toolchain recognizes that it can't use a LUT RAM for the array and switches to using slice registers. There are only four registers in a slice, so the slice requirements jump massively.

Here's the Verilog code, illustrating the one line that changes between inferring a write clock or not:
    reg  [3:0]  ram_array [0:(RAM_ARRAY_SIZE-1)];
`ifdef CLOCKED_LUTRAM
    always @(posedge sysclk) begin
`else
    always @(*) begin
`endif
        if (write) begin
            ram_array[addr] <= data_in;
        end
    end
    assign data_out  = ram_array[addr];
    assign data2_out = ram_array[addr2];
All the other logic in the i4002 emulation uses latched registers rather than clocked flip-flops. The two assign statements at the end of the code block show this supports dual-port access to the emulated register. Here's a comparison of the resource requirements:

Resource Latched Clocked
Occupied slices 204 (14.3%) 34 (2.4%)
Slice Registers 550 (4.8%) 38 (0.3%)
Slice LUTs as Logic 357 (6.2%) 20 (0.3%)
Slice LUTs as Memory 0 (0.0%) 16 (1.1%)

That's a pretty huge difference. Clearly I need to pass the system clock down through the i4002 module to the LUT RAM arrays that implement the emulated i4002 registers.



What about the i4001 ROM? My intent has always been to put the ROM array into a Block RAM. The standard Busicom calculator had four i4001 ROMs, providing 1024 bytes of code storage. However, the version with the optional square root support required a fifth ROM for a total of 1280 bytes. The easiest way for me to support that is to use one of the Spartan-6's 18K-bit Block RAMs in a 8-bit by 2048 configuration. I haven't decided exactly how to have five i4001 instantiations share a single BRAM, but I have a pretty firm idea.

For the purposes of this test, though, I simply hacked my 256-byte i4001 emulation to expand the ROM depth to 2048 bytes. Again, here's the Verilog code, illustrating the one line that changes between inferring a read clock or not:
    reg  [7:0]  rom_array [0:2047];
    reg  [7:0]  rom_temp;
`ifdef CLOCKED_BRAM
    always @(posedge sysclk) begin
`else
    always @(*) begin
`endif
        rom_temp <= rom_array[rom_addr];
    end
The comparison is a bit more tricky than with the i4002 RAM, because the toolchain knows about the contents of the ROM and optimizes the non-BRAM version a bit. For example, the last 768 bytes of the array are known to be zero. But here's a comparison of the resource requirements using the Busicom software including the square root code anyway:

Resource Latched Clocked
Occupied slices 100 (7.0%) 14 (1.0%)
Slice Registers 18 (0.2%) 18 (0.2%)
Slice LUTs as Logic 226 (4.0%) 5 (0.1%)
16K-bit Block RAMs 0 (0.0%) 1 (3.1%)

This isn't as bad as I'd feared, but it's still far more wasteful than using a BRAM. The ISE Technology Schematic view shows the latched version as having a large array of LUTs translating the addresses into data. Icky. I think I'll stick with the Block RAM implementation, even if it requires a read clock.

No comments:

Post a Comment