The DIY Mark I CPU
Every once in a while I get the urge to reminisce about a past project. So, today let’s look back on one of my favorite projects from my undergrad computer engineering degree, or as I like to call it:
The Time When I Built a Custom CPU Because Writing Software Was Easier Than Designing Hardware
One of my lab courses focused on building a sequence of group projects with ever increasing complexity. Most of them ended up being games with a focus on an FPGA development board, a serial connection, and some custom software running on a desktop computer. The individual projects that have stuck around in my mind are:
-
A GAL (Generic Array Logic) and discrete component based traffic light control system with a pedestrian button.
-
An etch a sketch application where the FPGA acted as a hardware controller and the desktop computer translated move operations and whether the cursor was depressed into a displayed image
-
A game of hangman where the hardware system read key events from a PS1 keyboard and then submitted full words as guesses to the computer for updating the display.
-
A signal capture tool using an ADC/DAC set of chips with an on desktop visualizer
-
A student selected project, which for my team was a two player game of pong displayed on an analog oscilloscope.
The whole sequence of projects started out with a GAL+discrete logic based assignment. We were tasked with building up a system for a standard 4 way intersection with an set of pedestrian crossings and associated pedestrian crossing request button.
We had lookup tables in the GAL, counters, timers, logic to decode our state into which lights were on, binary to 7 segment decoder chips, etc. It was a lot. As you might be able to see what we might have needed the most was some more rigor in terms of our wiring job, but we managed to complete the task. The project was a combination of loosely planning things and organically patching any mistakes we had made while watching the system through a logic squid. It was remarkable how much physical devices slowed down development.
The subsequent projects took us one step back from 80s style of physically building out logic into the era of defining our systems in a hardware description language, namely VHDL. This sped things up considerably, though the tooling available to students was subpar at the time (or at least that was the case at my university), so I wanted to make another technological leap and move into software. That required us to have a CPU of some sort and the assignments were always about turning in the hardware description of a system, so it couldn’t be imported from an external project like OpenCores. I knew from my tinkering in open source software that the later projects would be either take less time or be more fun with a basic CPU that could be reused from project to project, so I set off and made one. Certainly it was creating something bespoke with no online resources, but at the time resources for VHDL were *sparse*.
The classes focused around state machines, so how much harder could it be to build up a programmable state machine? A computer or in our case a very very terrible minimal computer was born. The CPU itself was copied between projects with near zero changes and just enough custom peripherals were created per assignment to get the system over the finish line.
System Overview
The DE1 development board we worked with has a bunch of LEDs, some switches, a few momentary buttons, some RAM, and a few electrical connections to the rest of the world in the form of a serial port, a PS2 keyboard connection, and some general purpose IO pins. There are some other bells and whistles, but they aren’t particularly relevant.
The custom Mark 1 CPU was built to handle the orchestration between these peripherals and was as simple as you could reasonably get. At the time I was often tinkering around with some of the AVR processors and I enjoyed dealing with the simplicity of an 8 bit system, so I made the Mark 1 a fully 8 bit system. 8 bits of addressable memory, 8 bits of addressable program ROM, and only four 8-bit or less registers. By the last project the instruction set had grown to 22 opcodes and it was running at a blazing fast 50MHz. It was certainly stripped down with no interrupts, no relative jumps, no pipelines, and no variable cycle instructions, but it worked well enough for the course.
Architecture
As mentioned before, the CPU was thoroughly an 8-bit system, 8-bit addressable memory, 8-bit program ROMS, 8-bit operands, and four 8-bit or less registers. Those registers were:
-
The Status register (SR)
-
The general purpose register (REG)
-
The indexing register (IDX)
-
The program counter (PC)
SR stored system flags such as the is-equal flag, the is-zero flag, and other application specific flags. REG is used to for general loading from memory, arithmetic operations, comparisons, and temporary storage. IDX is used for stack management, indirect memory access, and iteration. PC is used to track the current position in the system ROM, which determines the current instruction; In other systems this is referred to as an Instruction Pointer (IP).
Our instruction set can be summarized below:
Memonic | Arg. | PC | SR | REG | IDX | ALU | Bus | Cycles | Notes |
---|---|---|---|---|---|---|---|---|---|
read |
Ar |
. |
. |
W |
. |
i.. |
R |
2 |
$REG := (Ar) |
load |
V |
. |
. |
W |
. |
i.. |
. |
1 |
$REG := V |
send |
Ar |
. |
. |
R |
. |
i.. |
W |
1 |
(Ar) := $REG |
test |
V |
. |
W |
. |
. |
i.. |
. |
1 |
$SR ⇐ $REG==V |
jsrr |
. |
W |
. |
R |
. |
… |
. |
1 |
$PC := $REG |
goto |
Ap |
W |
. |
. |
. |
… |
. |
1 |
$PC := Ap |
brnz |
Ap |
W |
R |
. |
. |
i.. |
. |
1 |
$PC := Ap if $REG!=0 |
brzz |
Ap |
W |
R |
. |
. |
i.. |
. |
1 |
$PC := Ap if $REG==0 |
breq |
Ap |
W |
R |
. |
. |
i.. |
. |
1 |
$PC := Ap if $SR.equal is true |
brc0 |
Ap |
W |
R |
. |
. |
i.. |
. |
1 |
$PC := Ap if control 0 is true |
brc1 |
Ap |
W |
R |
. |
. |
i.. |
. |
1 |
$PC := Ap if control 1 is true |
addd |
V |
. |
. |
RW |
. |
i.F |
. |
1 |
$REG := $REG + V |
subx |
V |
. |
. |
RW |
. |
i.F |
. |
1 |
$REG := $REG - V |
spsh |
V |
. |
. |
. |
RW |
ii. |
W |
1 |
($IDX++) := V push V to stack |
rpsh |
. |
. |
. |
R |
RW |
ii. |
W |
1 |
($IDX++) := $REG push $REG to stack |
spop |
. |
. |
. |
W |
RW |
id. |
R |
2 |
$RED := (--$IDX) pop value to $REG |
lspt |
V |
. |
. |
. |
W |
i.. |
. |
1 |
$IDX := V load stack pointer |
sidx |
V |
. |
. |
. |
W |
i.. |
. |
1 |
$IDX := V set index |
iidx |
. |
. |
. |
. |
RW |
ii. |
. |
1 |
$IDX := $IDX+1 |
lidx |
. |
. |
. |
W |
R |
i.. |
R |
2 |
$REG := ($IDX) |
swap |
. |
. |
. |
RW |
RW |
i.. |
. |
1 |
swap($REG,$IDX) |
trap |
. |
W |
W |
W |
W |
… |
. |
1 |
Reset system |
noop |
. |
. |
. |
. |
. |
i.. |
. |
1 |
Do nothing |
R(ead), W(rite), .(nothing), V(alue), Ar(Address/RAM), Ap(Address/Program), ALU (PC-IDX-REG) i(ncrement ALU), d(ecrement ALU), F(ull ALU)
In total these 22 opcodes were in the final VHDL version of the CPU that I was
able to locate (technically 23, but lspt
and sidx
are identical).
No opcode encoding work was done, that was left to the VHDL generation tools,
but opcodes could easily be 8 bits to fit with the overall theme of the project.
One challenge of working with this instruction set was that relative jumps are not in the instruction set which means that a list of destination addresses needed to be maintained. Most code modifications were painful if it shifted code by even one address, so the Mark I CPU even has an assembler, but I’ll get to that a bit later.
Peripherals
You might notice that the CPU only interacts with registers, the 256 bytes of memory, and a few control lines with those branch instructions. So how does it work with the rest of the IO world?
Memory mapped peripherals.
Similar to embedded microcontrollers if the CPU writes to a given 'magic' address it is going to be writing or reading values in a peripheral that’s wired onto the main bus. Heck, even memory is a peripheral in our case. For most of the projects this CPU was used in we were required to use the SRAM chip on the DE1 board.
Since I’ve lost some of the code used with the Mark I based CPU projects I don’t have an exhaustive list, but I can find evidence of memory mapped peripherals for:
-
SRAM : General purpose memory
-
A debug interface : manually Stepping the CPU and visualizing the address/data lines
-
A timer : for accurately and regularly tending to tasks. This peripheral had to be polled since no interrupts exist in this CPU
-
A UART : for communicating to computers over a serial link
-
Beeper : for making sounds when a player has done an action in games
-
A vector graphics video driver with selectable VROM banks : For rendering graphics on X/Y mode oscilloscope outputs
-
A PS2 keyboard driver : for typing in commands
-
A character LCD driver : for displaying what uses have typed
-
A seven segment display driver : typically for visualizing debug information
-
User input buttons : for binary user inputs or for stepping through the code at low speeds (It’s pretty hard to visualize things by eye at 50 MHz)
If I attempt to build another CPU in the future I expect a huge portion of the time will focus on the debug interface. I personally find them delightful. For a particularly good set of examples take a look at the Magic processor’s front panel or the currently available PDP-11 replica front panels by Obsolescence Guaranteed. In fact the user interface component is a delightful space to explore for complex systems whether that’s processors, synths, control systems, or some other complex system that needs to precisely communicate information to a skilled user.
Assembler & Examples
As mentioned before, the processor does not have relative jumps which makes program modifications tedious if we don’t have an assembler. The assembler itself is pretty trivial with most of the point being tracking the addresses of assembled instructions. In fact almost all of the meaningful bits of it can be seen in these yacc/lex snippets:
program:
program statement '\n'
|
;
statement:
express
| label express
| label
| directive
|
;
directive:
EQU SYMBOL NUMBER {$2->val=$3;}
| START {pass++;pc=0;}
| SRC SYMBOL {if(pass>1) printf("--%s\n",$2->name);}
;
express:
OPCODE {line($1, 0x00);}
| OPCODE SYMBOL {line($1, $2->val);}
| OPCODE NUMBER {line($1, $2);}
;
label:
SYMBOL ':' {$1->val=pc;}
;
;.* /*Ignore comments*/
[ ]+ /*Ignore whitespace*/
iidx|read|test|swap|spop yylval.string=strdup(yytext); return OPCODE;
load|send|nopp|goto|subx yylval.string=strdup(yytext); return OPCODE;
rpsh|rpop|spsh|trap yylval.string=strdup(yytext); return OPCODE;
addd|brc[0-2]|jsrr|breq|sidx yylval.string=strdup(yytext); return OPCODE;
brnz|brzz|lidx yylval.string=strdup(yytext); return OPCODE;
#[0-9a-fA-F]+ sscanf(yytext+1,"%X",&yylval); return NUMBER;
%[0-9a-zA-Z] yylval.number = (char) yytext[1]; return NUMBER;
[_a-zA-Z]+ yylval.sym=intern(yytext); return SYMBOL;
^\.start return START;
^\.equ return EQU;
^\.src return SRC;
: return *yytext;
\n return *yytext;
A program consists of a sequence of statements. Those statements are either labels for code to jump to, opcodes with optional arguments, definitions of constants, or definitions of functions which are printed as comments in the generated code. Easy enough for something as low level as an assembler, right?
So, what does a real program look like? Well, let’s look at a slightly modified assembly project. Before looking at the code this is from the project where a collection of analog values are read from an external chip, then transmitted over a serial connection to be displayed on a PC.
To start off, let’s look at some constants:
;Define constants
.equ UART_TX #02
.equ LEDS #01
.equ ADC #06
.equ STACKP #21
.equ STACK_H #F0
.equ BINTOASCII #08
.equ ASCII_H #09
.equ ASCII_L #08
.equ TMP_SEND #A2
;brc0 is uart ready send
In this program the CPU has a few system peripherals:
-
A set of on board LEDs to display running values of our input memory mapped at 0x01 (write only)
-
The serial link to the computer (UART) memory mapped to address 0x02 (write only)
-
The analog to digital chip mapped at 0x06 (read only)
-
A binary to ascii-hex conversion ROM mapped at 0x08 (input) and 0x08..0x09 (output)
In addition the stack that is used to store data which is sent to the PC is 0x21..0xF0, which should correspond to 103 8 bit values (in hex) and a null terminator.
In addition, there is a set of comments that indicate that the UART is connected to the CPU’s control line 0 to signal when data can be sent.
;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;
.src INIT_ROUTINE
;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;
init:
sidx STACKP ;config stack
spsh #4D ;'M'
;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;
.src MAIN_ROUTINE
;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;
main:
goto inputs
in_ret: goto outputs
out_ret: goto delay
del_ret: goto main
The beginning of the program initializes the array/stack which is used to store data which will be transmitted and then it establishes the main program flow. inputs are collected, the input is output to the computer, the system waits, and then it repeats the process.
;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;
.src INPUT_ROUTINE
;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;
inputs:
read ADC ;Get analog value
nopp
send BINTOASCII ;Convert to ascii
send LEDS
read ASCII_H
nopp
rpsh ;Push values onto stack (char*)
nopp
read ASCII_L
nopp
rpsh
swap
test STACK_H ;Check for end of loop
swap
breq send_data
send_ret: goto in_ret
Ignore the no-operation instructions here, but as the comments indicate data is
read from the ADC peripheral.
The data is converted into two bytes of hex-ascii (e.g. 0xfa), then stored on the stack.
At the end we check to see if the stack pointer is at its maximal value with the
test
instruction and if it is at that value we go to the next stage, otherwise
we keep gathering more input.
;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;
.src SEND_ROUTINE
;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;
send_data:
spsh #0A ;Add new line
spsh #00 ;Add null terminator
sidx STACKP ;Set start of char*
send_loop:
lidx ;Read in next value
nopp
brzz send_exit ;Check for null terminator
send TMP_SEND
wait: brc0 do_send
;do something useful while waiting
read ADC
nopp
send LEDS
goto wait
do_send: read TMP_SEND
nopp
send UART_TX ;Transmit data
iidx
goto send_loop
send_exit:
sidx STACKP ;reset stack
spsh #4D ;'M'
goto in_ret
;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;
.src OUTPUT_ROUTINE
;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;
outputs: goto out_ret
Now with output we’re getting to point we have something more complex. In the first 3 instructions we establish that we have a null terminated string starting at the constant address STACKP. We now want to send this string byte by byte to the UART. This is done by the send_loop. When a character is sent we have to wait for the UART to be ready for the next byte at which point we stay in the wait loop until the control line 0 indicates we can send the next character. When we do send the data it is moved from the TMP_SEND address to the UART_TX memory mapped device and we repeat the process until the full string has been sent.
Rewritten in C this looks like:
uint8_t *stack_pos;
volatile uint8_t *led;
volatile uint8_t *adc;
volatile uint8_t *uart;
void send_data(void)
{
uint8_t chr;
(*stack_pos++) = '\n';
(*stack_pos++) = '\0';
stack_pos = STACKP;
while(1) {
chr = *stack_pos++;
if(chr == 0)
break;
while(uart_busy())
*led = *adc;
*uart = chr;
}
stack_pos = STACKP;
(*stack_pos++) = 'M';
}
;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;
.src DELAY_ROUTINE
;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;
delay: load #00
delay_loop: addd #FF
nopp
nopp
brnz delay_loop
goto del_ret
The delay routine is the last one it counts from 0 to 256 (more or less) and we can calculate how long this should take to execute. The processor runs at 50MHz with one instruction per cycle, one loop is 4 instructions (add+2x no-op+branch). So, 256*4+1+1 (for the initial load and final goto), so roughly 20 micro-seconds. Not much time at all.
Quirks
The processor had its fair share of quirks and general limitations. While this is by no means a comprehensive list a few that were pain points at the time:
-
It was a 50MHz, all actions on rising edge of clock, nothing on falling edge. The speed was higher than was needed and the lack of falling edge operations resulted in several quirks around reads and writes bleeding into a 2 cycle instruction
-
No secondary register, so manually allocating the single register to various tasks was fiddly
-
No function call/return opcodes, so subroutine calls are verbose operations where the user had to save their variables before placing a return address on the stack and then making sure the called code ends with popping a value from the stack and unconditionally jumping to the current register value
-
Read is a multi-tick operation and multi-tick instructions do not exist, so no op instructions were manually inserted
-
Occasional bugs where on some revisions jumps would occur before the program counter was incremented.
-
256 instructions really isn’t a lot. For basic problems sure, that’s loads and when you can write arbitrary peripherals it’s workable, but I did run out of program space more than once in the development of the final project
Looking back
It’s pretty amazing looking back at the specs for the DE1 board. Those boards are still sold, but an updated revision. The newer revision is a heck of a lot more powerful than they used to be. Reading the spec sheet of the individual FPGA on my board I’ve got 18,752 logic elements and the new FPGA simply specifies 77k logic elements which gives me the sense that students aren’t running into those situations where they’ve exhausted every single logic unit on one of these boards while doing classwork. The newer board don’t even need the effort of building a custom CPU since they’ve got a built in ARM core. Heck, the ARM core has access to a full GB of RAM, which is remarkable compared to the 8MB of SDRAM on my board and the 256 bytes of address space of the Mark I CPU.
The Mark I CPU worked, it did its job and while I haven’t accomplished my personal goal of making a CPU from raw transistors it got me a decent part of the way there. If I do go all the way to transistors I’m guessing I’ll go down to 4 bits or have the pcb fab assemble the ocean of transistors. Overall it’s a fun space and simple instruction sets are really interesting. I don’t play around with them as much as I used to, but it is a delight to read through ISRs like the z80 or AVR. There’s a beauty to them that is lost in the pragmatic full instruction sets you see in x86 and arm.
So, I’ll leave this article here for now. I may well get back to it and edit it some more or extend it, but this has been sitting in my drafts folder long enough that I should put it up on my website before it collects another layer of dust.