Monday, January 20, 2020

Treating CLK1/2 signals as global latch gate signals

If I were to change the Verilog i4004 emulation to treat the MCS-4 system CLK1 and CLK2 signals as latch gates and not as clock enables, it seemed it might make sense to treat them as I would global clock signals.

My first step in that direction was to create Xilinx timing constraints describing the timing of these signals and their phase relationship with each other and the 20 MHz system clock. I started by using the ISE Timing Constraints Editor, but then discovered it won't let you define related timespecs with other than 50% duty cycles. The CLK1 and CLK2 signals are high for 400ns of each 1350ns cycle, or just under 30%. In a modern synchronously-clocked system where everything was clocked on rising edges this might not matter, but in a latched system it might be important to note that CLK1 and CLK2 do not overlap at any time, even with propagation delays.

Here's the resulting User Constraints File (ucf):
#Created by Constraints Editor (xc6slx9-tqg144-2) - 2020/01/17
#20 MHz external oscillator input
NET "clk" TNM_NET = clk;
TIMESPEC TS_clk = PERIOD "clk" 50 ns HIGH 50%;
#Internally-generated MCS-4 CLK1
NET "clk1" TNM_NET = clk1;
TIMESPEC TS_clk1 = PERIOD "clk1" TS_clk * 27 PHASE 5 ns HIGH 400 ns;
#Internally-generated MCS-4 CLK2
NET "clk2" TNM_NET = clk2;
TIMESPEC TS_clk2 = PERIOD "clk2" TS_clk1 PHASE 800 ns HIGH 400 ns;
Next I thought about how to distribute these global signals. A Spartan-6 FPGA has 16 "high-speed, low-skew global clock resources" used to distribute clock signals throughout the chip. Clock signals are placed on these through the use of a BUFGMUX or BUFG element and are intended to drive the Clock input to flip-flops. In the Spartan-6, transparent latches are implemented in the same circuitry as clocked flip-flops, so it made sense to me to put the CLK1 and CLK2 signals on these global clock distribution resources.

To do this I forced the CLK1 and CLK2 nets to be global clocks by instantiating BUFG elements with these nets as outputs. When I rebuilt the code I received this error:
ERROR:Place:1136 - This design contains a global buffer instance <BUFG_clk2>, driving the net, <clk2>, that is driving the following (first 30) non-clock load pins.
   < PIN: control/clk2_x22_AND_8_o11.A4; >
   < PIN: control/clk2_src_AND_13_o1_SW0.A5; >
   This is not a recommended design practice in Spartan-6 due to limitations in the global routing that may cause excessive delay, skew or unroutable situations. It is recommended to only use a BUFG resource to drive clock loads. If you wish to override this recommendation, you may use the CLOCK_DEDICATED_ROUTE constraint (given below) in the .ucf file to demote this message to a WARNING and allow your design to continue.
   < PIN "BUFG_clk2.O" CLOCK_DEDICATED_ROUTE = FALSE; >
Where did this come from? Looking at the Technology Schematic schematic generated by ISE I examined the sources of the named nets, and identified this section of my i4002 emulation:
    reg     src;
    always @(*) begin
        if (clk2 & x22) begin
            src <= cm & (data[3:2] == CHIP_NUMBER);
            if (src) reg_num <= data[1:0];
        end
        if (clk2 & x32) begin
            if (src) char_num <= data;
        end
    end
The purpose of this code is to recognize when this i4004 is selecting a ROM or RAM chip and specifying an address within that chip during the X2 and X3 phases of a SRC instruction. When this happens, the i4002 is required to identify that it is being selected by the CM input and bits [3:2] of the data bus during subcycle X2. The i4002 thus selected is required to latch the register number from bits [1:0] of the data bus during X2, and the character number from bits [3:0] of the data bus during X3.

[Note: I realized later that this logic doesn't properly implement the i4002 behavior, but let's ignore that as it's not related to the use of global clock distribution resources for CLK1 and CLK2.]

The Place:1136 error arises because the synthesis phase has generated a logical AND with clk2 and other signals to gate a latch. The implementation phase then objects to this, as clk2 is on a global clock line and should drive a clock (or gate) input directly rather than through a LUT. The transparent latch primitive LDCE has separate gate and gate enable inputs, so why didn't it use clk2 as the latch gate and the result of the logical AND as the gate enable? Apparently it's not that smart. Thinking I needed to make it more obvious to the synthesis I rewrote the code to look like this:
    reg     src;
    always @(*) begin
        if (clk2) begin
            if (x22) begin
                src <= cm & (data[3:2] == CHIP_NUMBER);
                if (src) reg_num <= data[1:0];
            end
            if (x32) begin
                if (src) char_num <= data;
            end
        end
    end
While this fixed the problem with the latching of char_num, the problem persisted with src and reg_num, and began reporting the same error for a different signal in a later block. I tried placing the KEEP constraint on the clk1 and clk2 signals to keep the synthesizer from merging this signal with others in a logical operation, but that didn't help.

Wondering if my assumption about transparent latch gates being connected to global clock lines was correct, I decided to try implementing this using LDCE primitives rather than expecting the tools to infer this behavior. I won't post the whole mess, but the latching of src looked something like this:
    wire    src;
    LDCE LD_src (
        .Q      (src),
        .CLR    (1'b0),
        .D      (cm & (data[3:2] == CHIP_NUMBER)),
        .G      (clk2),
        .GE     (x22)
    );  
After tracking down all the places XST wanted to perform logical operations on the clk2 signal in my i4002 emulation, all the errors were resolved. This validates my assumption that the LDCE gate input is treated similarly as the FDCE clock input, in that it can be driven from a global clock line.

While using LDCE primitives also avoids the Xst:737 warning about an inferred latch, it presents other problems. One is that the LDCE primitive handles only one bit. I addressed this by creating modules LDCE2, LDCE4, and LDCE5 to handle the multi-bit cases I encountered.

A bigger problem, though, is that I found cases in the i4004's schematics where the CLK1 and CLK2 signals are not used only as latch gates, but are involved in logical operations to generate other signals. I don't believe it's worth attempting to decompose those, if it's even possible. In the end I reverted all the changes I made during these experiments.

Realistically, there is no requirement for the CLK1 and CLK2 signals to be routed on the global clock lines. It's a nice idea, but the two-phase design of the MCS-4 chips avoids most of the critical timing issues. And in an FPGA where a LUT can perform complex logical operations in less than half a nanosecond, a 400ns latch gate pulse is an eternity.



One of the documents I found especially useful is "Nonblocking Assignments in Verilog Synthesis, Coding Styles That Kill!" by Clifford E. Cummings of Sunburst Designs. This 20 year-old paper addresses many issues related to writing good behavioral code in Verilog. While Cliff Cummings is somewhat opinionated, and I don't agree with some of his ideas on code formatting, I've learned a lot from reading his various papers and presentations.

No comments:

Post a Comment