Wednesday, July 29, 2020

Running the Busicom software in a Spartan-6

Eight years ago I decided I wanted to learn about programmable logic: PLAs, CPLDs, and FPGAs.

About the same time, I became aware that a group of engineers and computer history buffs had gotten Intel to release the schematics for first commercially-available microprocessor: the Intel 4004 CPU. They'd also retrieved the software that drove the first commercial product that used the i4004: the Busicom 141-PF.

I decided to re-create the i4004 CPU in an FPGA.

It turns out that's a lot like learning to swim by attempting to cross the English channel. But I've never been known to shy away from a challenge.

Sunday, July 26, 2020

An (unlatched) house of cards

While trying to understand why my latch-based implementation of the Busicom 141-PF calculator didn't work when implemented in a Spartan-6, I tested the i4001 ROM implementation separately. It seemed to work, so I focused on the i4004 CPU.

Now that I have the i4004 CPU switched back to using clocked flip-flops I tried a broader test. I instantiated one CPU, one ROM, and one RAM, and loaded the ROM with the basic functional test that is loaded by default into the i400x analyzer. This test starts with subroutine calls and returns, then checks conditional jumps before testing the ALU functions. My intent was to see whether there were enough flip-flop clock edges per CLK1 or CLK2 pulse for everything to propagate as needed. It's the path through the ALU I'm most worried about.

I first started with behavioral simulation, and noted that the first few instructions executed as expected. This gave me confidence to try a Post-P&R simulation, which failed. Unlike the original failure, though, where the address placed on the 4-bit data bus alternated between 000 and 001, the ROM address on the bus appeared correct. However, the address presented to the Block RAM containing the instructions was always zero. This suggested the i4001 ROM emulation wasn't working.

This made no sense to me. I'd tested the latch-based i4001 in both Post-P&R simulation and a real Spartan-6 and it seemed to work just fine, but when combined with other modules it appears to fail.

In my career as a professional software engineer I've sometimes encountered code whose author claimed didn't work because of "a bug in the optimizer". Now, I've found a couple of very real compiler bugs, but they're extremely rare in commonly-used compilers. This sort of problem is almost invariably caused by the programmer not understanding subtle details of the language, and not by bugs in the compiler.

With this in mind I decided to convert the i4001 back to using edge-clocked flip-flops; something I sort-of expected to do anyway but hadn't gotten to.

Of course the thing worked immediately. I think I'm done playing with the latch-based implementation of any of these chips.



As I've said before, the problem I have with the edge-clocked flip-flop version of these chips is that the original design often assumes data can flow through multiple latches during a single CLK1 or CLK2 pulse. Since data can only propagate through a flip-flop on a clock edge, there must be more clock edges during a CLK1 or CLK2 pulse than there are flip-flops in series.

An instruction cycle is divided into eight subcycles using 8-stage shift registers that produce one-hot outputs (meaning only one of the outputs is active at any one time, uniquely identifying the subcycle). The shift register in the i4004 is self-initializing, and produces a SYNC signal output that is used by the shift registers in the i4001 and i4002 to synchronize themselves to the one in the i4004.

Rather than duplicate this critical code in several places I'd extracted both the timing generator and the timing recovery logic into separate Verilog modules. This allows me to test them individually, and use them to test other modules.

One of my concerns was that using edge-clocked flip-flops would result in a one-clock skew in the timing between the generator and the recovery outputs. This would eat into the number of clock edges seen by the flip-flops within a CLK1 or CLK2 "clock" pulse, and result in data not arriving in time or in tristate output driver overlaps.

Subcycle timing signals change in response to CLK1 going active. In my current design, CLK1 (and CLK2) is the output of a flip-flop. The flip-flop that drives CLK1 changes state in response to a rising clock edge, and thus lags the clock edge by about a nanosecond. That means that the flip-flops in the subcycle shift register won't change state, because CLK1 hasn't yet changed when the rising clock edge occurs. Instead, the subcycle shift registers change state on the next rising clock edge, causing a delay of one clock cycle time.

This, in turn, means that the logic that depends on the subcycle timing signal won't change until the rising clock edge after that, or two clock edges after the one that caused the change in CLK1. That's two of my eight rising clock edges consumed already.

[Edit: It's actually only one clock edge. The combinational logic gets the entire period between the CLK1 pulse going active and the next clock edge to update, but the flip-flop doesn't act on the results until that clock edge occurs.]

See why I'm concerned about the possibility of a cycle of skew between the subcycle signals in the i4004 and other chips?

Fortunately, the shift registers in the i4001 and i4002 see the same timing relationship of as the shift register in the i4004. Thus the two shift registers should stay synchronized with each other.

But after a series of failures I'm done making assumptions. The behavioral simulation looked good, but what about the Post-P&R simulation?


These screen captures shows the Post-P&R simulation waveforms. We can clearly see that the generated and recovered subcycle signals are in sync. Zooming in on this shows they change on the same clock edge.


Just to avoid disappointment later, I generated a bit file from this test and loaded it into the Spartan-6 on my P170-DH replacement board. The results shown on my logic analyzer match the Post-P&R simulation. I'd post a screen capture of this too, but having a resolution of 2ns rather than the 1ps of the simulation it's actually less interesting than those above.

Tuesday, July 21, 2020

Another flaw in the original i4004?

While trying to count the number of latches a signal might need to pass during either the CLK1 or CLK2 pulses, I took a look at a signal named by the analyzer "M12+M22+CLK1~(M11+M21)".  I determined its function is to gate the internal data bus into the scratchpad register array and instruction pointer array, so I tend to refer to this signal as the "data gate". Other signals determine which, if any, of the array cells are written.

This seemed straight-forward enough until I started looking at how this signal is generated. Here's the logic as depicted on the original i4004 schematics:


Sunday, July 19, 2020

More latching mux timing problems

As I mentioned in the previous post, my timing problems seemed to be related to the timing between the output of a mux and the latch intended to capture the output of the mux. Here's a simpler example.

This is the mux in the instruction pointer that determines whether the Effective Address Counter or the Refresh Counter is used to select the active DRAM row. Counting from the top left, the first two signals are the mux selector inputs: the subcycle X12 and X32 signals. The middle pair are the Refresh Counter outputs, and the right-most pair are the Effective Address Counter outputs.

During subcycles X12 and X22, a DRAM row is read and written back unmodified to refresh the row. Thus the Refresh Counter outputs are selected to drive the DRAM row decoder at the beginning of subcycle X12. For the rest of the instruction cycle the Effective Address counter needs to drive the DRAM row decoder, so it is selected at the beginning of subcycle X32.

Rather than have the selected counter actively drive the decoder continuously, this is a latching mux: when the selector signals are inactive, the previously-selected counter outputs are latched.

Friday, July 17, 2020

Latch timing failure

I think I found the problem in the i4004 CPU instruction pointer incrementer. And it doesn't bode well for the latch-based implementation.

Here's the problem. The instruction pointer DRAM is configured as four rows of 12 bits, each row representing one of the four instruction pointers. The normal cycle is:
  1. Pre-charge the DRAM column sense lines.
  2. Read all 12 bits of the active IP into a 12-bit register.
  3. Gate the low-order 4 bits onto the data bus.
  4. Add 1 to the low 4 bits, saving the carry out.
  5. Update the low 4 bits of the register.
  6. Gate the middle 4 bits onto the data bus.
  7. Add the carry to the middle 4 bits, saving the carry out.
  8. Update the middle 4 bits of the register.
  9. Gate the high 4 bits onto the data bus.
  10. Add the carry to the high 4 bits.
  11. Update the high 4 bits of the register.
  12. Write the 12-bit register back to the active IP.
Here's what Step 5 looks like in behavioral simulation. On the bottom we have the least significant bit output of the adder. The short (400ns) pulse in the middle is the write enable for the low 4 bits of the 12-bit register. Above that is the LSB of the 12-bit register itself.

Next let's look at the post-map simulation with the same arrangement of signals. The LSB of the adder becomes a 1 much later than in the behavioral simulation, and goes back to a 0 much sooner. How much sooner?

Here's the same post-map simulation, zoomed in at the falling edge of the write-enable. The adder output falls 25ps (that's picoseconds, or 0.000000000025 seconds) before the gate enable goes inactive, but that's long enough for the latch to capture a zero rather than a one. Bummer.

The problem appears to be in the way I've coded the 12-bit temporary register. This register (really charges on MOSFET inverter gates in a real i4004 CPU) is written from three non-overlapping sources. Because of the way the circuitry is implemented, any of these three can set the register content without conflict.

In an FPGA, this is implemented using a mux to select an input source and a storage element to retain the value. Since I coded the input selection logic as implemented in the real i4004 CPU, the mux selectors and the latch gate are driven by the same signals. In this case, though, the mux output is changing 25 picoseconds before the latch gate has gone inactive, and the latch captures the wrong value. This is why most modern logic uses clocked flip-flops rather than latches.

The challenge I face is to separate the mux selectors from the latch gates such that the mux outputs are stable before and after the latch is enabled. This is turning out to be non-trivial.

Thursday, July 16, 2020

Xilinx ISE iSim simulation modes

I've been doing some research into the various levels of simulation available using the Xilinx ISE toolchain and simulator. All of my simulation to this point had been at the behavioral level. I knew it was possible to simulate ASICs that had been designed using Verilog at the component level, but I'd never had need to look into this as behavioral simulation had been sufficient.

My research showed that using the Xilinx ISE toolchains, a design can be simulated in any of five stages along the way from HDL to loadable bit file:

Wednesday, July 15, 2020

Fixing combinational loops

As part of the conversion from edge-clocked flip-flops to transparent data latches I re-coded my i4004 counter module. This resulted in several warnings about combinational loops, but those warnings were buried in a sea of warnings about my use of data latches. It seemed to work well enough when implemented for the Spartan-3E, and I assumed it would work well enough on the Spartan-6.

I was wrong.