Saturday, October 1, 2022

Surprising BBL behavior

I had an interesting conversation with Jochen this week regarding a bit of code he found on page 98 of the Intel MCS-4 User's Manual. This is a short program intended to test the operation of the MCS-4 Evaluation Kit using the 4009-0009 interface chip. The program begins as shown here:

As a programmer, this makes little sense. The BBL (Branch Back and Load Accumulator) instruction is intended to return from a subroutine call. Yet here it's executed without a previous subroutine call (JMS) instruction being executed. Jochen asked, wouldn't this result in an infinite loop or other erroneous behavior?

My first reaction was to agree with him. Surely this must be a typographical error? After thinking about it more carefully, I realized the answer to that is no. This code has a deterministic and defined behavior. To understand why, it's necessary to look a bit more closely at the design of the Intel 4004 CPU.

First, it's important to realize that the 4004 CPU does not have a call stack as commonly implemented in most modern CPUs. Most commonly, a processor register usually referred to as the stack pointer (SP) points to an area of RAM that is used to save the program counter (PC) when a subroutine is called. As part of the execution of the subroutine call instruction, the SP is updated to point to the next unused memory location, usually by decrementing the SP, and the address of the instruction following the call is written to the memory location pointed to by the SP. A subroutine return instruction works by loading the PC from the memory location pointed to by SP, then SP is incremented. In this way multiple subroutine calls can be nested, and the execution flow returns to the calling routines as expected.

Despite references to a call stack in the documentation, the i4004 CPU does not have a stack. Instead, it has four independent program counters. A 2-bit counter called the Effective Address Counter (EAC) determines which of the four PCs is used as the current PC. Like most 2-bit binary counters, incrementing the EAC when it contains the value 3 rolls it over to 0.

During the CPU reset, all of the four program counters are reset to zero. The EAC is not reset, as all four possible states are equally valid; for the purposes of illustration, let's assume it starts with the value 3.

In the i4004 CPU, an instruction execution cycle consists of a sequence of eight smaller phases. For the purposes of this discussion, we can reduce these to three steps, each consisting of two or three phases:

  1. Addressing: the current PC is sent to the ROM, and the PC is incremented (A1, A2, A3).
  2. Memory Read: The instruction is read from the ROM (M1, M2).
  3. Execution: The instruction is decoded and executed (X1, X2, X3).

The subroutine call instruction (JMS) consists of two instruction cycles. The first fetches and decodes the JMS instruction itself. The second cycle fetches the address of the first instruction in the subroutine from ROM, increments the EAC to select a new PC, and loads the subroutine address into the newly selected PC.

The subroutine return instruction (BBL) functions by decrementing the EAC. (Actually, it increments the EAC three times, but the effect is the same). Critically for this discussion, the EAC is altered during the execution phases, after the current PC has been incremented.

To illustrate this, let's look at the sequence of execution of the instructions in the example, showing the state of the EAC and each of the four PCs during each instruction cycle. I've highlighted the active PC at the start of the instruction cycle in bold digits:

Cycle  EAC   PC0   PC1   PC2   PC3  Instruction
1 3 0 0 0 0 WRR
2 3 0 0 0 1 BBL
3 2 0 0 0 2 WRR
4 2 0 0 1 2 BBL
5 1 0 0 2 2 WRR
6 1 0 1 2 2 BBL
7 0 0 2 2 2 WRR
8 0 1 2 2 2 BBL
9 3 2 2 2 2 FIM (code)
10 3 2 2 2 3 FIM (data)

This shows clearly (or as clearly as I can manage) how this instruction sequence works.

In my opinion, this is really ugly code. In normal programming practice this sort of trickery should be avoided if at all possible. However, this was more common in the era when a 256 x 8-bit ROM was large. The example program uses every last byte available in a 4001 ROM chip.

Thanks to Jochen for giving me a fun mystery to chase down!

No comments:

Post a Comment