Every once in a while I get the urge to reminisce about a past project. So, today let’s look back on one of my favorite projects from my undergrad computer engineering degree, or as I like to call it:

The Time When I Built a Custom CPU Because Writing Software Was Easier Than Designing Hardware

One of my lab courses focused on building a sequence of group projects with ever increasing complexity. Most of them ended up being games with a focus on an FPGA development board, a serial connection, and some custom software running on a desktop computer. The individual projects that have stuck around in my mind are:

  1. A GAL (Generic Array Logic) and discrete component based traffic light control system with a pedestrian button.

  2. An etch a sketch application where the FPGA acted as a hardware controller and the desktop computer translated move operations and whether the cursor was depressed into a displayed image

  3. A game of hangman where the hardware system read key events from a PS1 keyboard and then submitted full words as guesses to the computer for updating the display.

  4. A signal capture tool using an ADC/DAC set of chips with an on desktop visualizer

  5. A student selected project, which for my team was a two player game of pong displayed on an analog oscilloscope.

2024 04 gal traffic light

The whole sequence of projects started out with a GAL+discrete logic based assignment. We were tasked with building up a system for a standard 4 way intersection with an set of pedestrian crossings and associated pedestrian crossing request button.

2024 04 traffic

We had lookup tables in the GAL, counters, timers, logic to decode our state into which lights were on, binary to 7 segment decoder chips, etc. It was a lot. As you might be able to see what we might have needed the most was some more rigor in terms of our wiring job, but we managed to complete the task. The project was a combination of loosely planning things and organically patching any mistakes we had made while watching the system through a logic squid. It was remarkable how much physical devices slowed down development.

The subsequent projects took us one step back from 80s style of physically building out logic into the era of defining our systems in a hardware description language, namely VHDL. This sped things up considerably, though the tooling available to students was subpar at the time (or at least that was the case at my university), so I wanted to make another technological leap and move into software. That required us to have a CPU of some sort and the assignments were always about turning in the hardware description of a system, so it couldn’t be imported from an external project like OpenCores. I knew from my tinkering in open source software that the later projects would be either take less time or be more fun with a basic CPU that could be reused from project to project, so I set off and made one. Certainly it was creating something bespoke with no online resources, but at the time resources for VHDL were *sparse*.

The classes focused around state machines, so how much harder could it be to build up a programmable state machine? A computer or in our case a very very terrible minimal computer was born. The CPU itself was copied between projects with near zero changes and just enough custom peripherals were created per assignment to get the system over the finish line.

System Overview

2024 04 DE1

The DE1 development board we worked with has a bunch of LEDs, some switches, a few momentary buttons, some RAM, and a few electrical connections to the rest of the world in the form of a serial port, a PS2 keyboard connection, and some general purpose IO pins. There are some other bells and whistles, but they aren’t particularly relevant.

The custom Mark 1 CPU was built to handle the orchestration between these peripherals and was as simple as you could reasonably get. At the time I was often tinkering around with some of the AVR processors and I enjoyed dealing with the simplicity of an 8 bit system, so I made the Mark 1 a fully 8 bit system. 8 bits of addressable memory, 8 bits of addressable program ROM, and only four 8-bit or less registers. By the last project the instruction set had grown to 22 opcodes and it was running at a blazing fast 50MHz. It was certainly stripped down with no interrupts, no relative jumps, no pipelines, and no variable cycle instructions, but it worked well enough for the course.

Architecture

As mentioned before, the CPU was thoroughly an 8-bit system, 8-bit addressable memory, 8-bit program ROMS, 8-bit operands, and four 8-bit or less registers. Those registers were:

  • The Status register (SR)

  • The general purpose register (REG)

  • The indexing register (IDX)

  • The program counter (PC)

SR stored system flags such as the is-equal flag, the is-zero flag, and other application specific flags. REG is used to for general loading from memory, arithmetic operations, comparisons, and temporary storage. IDX is used for stack management, indirect memory access, and iteration. PC is used to track the current position in the system ROM, which determines the current instruction; In other systems this is referred to as an Instruction Pointer (IP).

Our instruction set can be summarized below:

Memonic Arg. PC SR REG IDX ALU Bus Cycles Notes

read

Ar

.

.

W

.

i..

R

2

$REG := (Ar)

load

V

.

.

W

.

i..

.

1

$REG := V

send

Ar

.

.

R

.

i..

W

1

(Ar) := $REG

test

V

.

W

.

.

i..

.

1

$SR ⇐ $REG==V

jsrr

.

W

.

R

.

…​

.

1

$PC := $REG

goto

Ap

W

.

.

.

…​

.

1

$PC := Ap

brnz

Ap

W

R

.

.

i..

.

1

$PC := Ap if $REG!=0

brzz

Ap

W

R

.

.

i..

.

1

$PC := Ap if $REG==0

breq

Ap

W

R

.

.

i..

.

1

$PC := Ap if $SR.equal is true

brc0

Ap

W

R

.

.

i..

.

1

$PC := Ap if control 0 is true

brc1

Ap

W

R

.

.

i..

.

1

$PC := Ap if control 1 is true

addd

V

.

.

RW

.

i.F

.

1

$REG := $REG + V

subx

V

.

.

RW

.

i.F

.

1

$REG := $REG - V

spsh

V

.

.

.

RW

ii.

W

1

($IDX++) := V push V to stack

rpsh

.

.

.

R

RW

ii.

W

1

($IDX++) := $REG push $REG to stack

spop

.

.

.

W

RW

id.

R

2

$RED := (--$IDX) pop value to $REG

lspt

V

.

.

.

W

i..

.

1

$IDX := V load stack pointer

sidx

V

.

.

.

W

i..

.

1

$IDX := V set index

iidx

.

.

.

.

RW

ii.

.

1

$IDX := $IDX+1

lidx

.

.

.

W

R

i..

R

2

$REG := ($IDX)

swap

.

.

.

RW

RW

i..

.

1

swap($REG,$IDX)

trap

.

W

W

W

W

…​

.

1

Reset system

noop

.

.

.

.

.

i..

.

1

Do nothing

R(ead), W(rite), .(nothing), V(alue), Ar(Address/RAM), Ap(Address/Program), ALU (PC-IDX-REG) i(ncrement ALU), d(ecrement ALU), F(ull ALU)

In total these 22 opcodes were in the final VHDL version of the CPU that I was able to locate (technically 23, but lspt and sidx are identical). No opcode encoding work was done, that was left to the VHDL generation tools, but opcodes could easily be 8 bits to fit with the overall theme of the project.

One challenge of working with this instruction set was that relative jumps are not in the instruction set which means that a list of destination addresses needed to be maintained. Most code modifications were painful if it shifted code by even one address, so the Mark I CPU even has an assembler, but I’ll get to that a bit later.

Peripherals

You might notice that the CPU only interacts with registers, the 256 bytes of memory, and a few control lines with those branch instructions. So how does it work with the rest of the IO world?

Memory mapped peripherals.

Similar to embedded microcontrollers if the CPU writes to a given 'magic' address it is going to be writing or reading values in a peripheral that’s wired onto the main bus. Heck, even memory is a peripheral in our case. For most of the projects this CPU was used in we were required to use the SRAM chip on the DE1 board.

Since I’ve lost some of the code used with the Mark I based CPU projects I don’t have an exhaustive list, but I can find evidence of memory mapped peripherals for:

  • SRAM : General purpose memory

  • A debug interface : manually Stepping the CPU and visualizing the address/data lines

  • A timer : for accurately and regularly tending to tasks. This peripheral had to be polled since no interrupts exist in this CPU

  • A UART : for communicating to computers over a serial link

  • Beeper : for making sounds when a player has done an action in games

  • A vector graphics video driver with selectable VROM banks : For rendering graphics on X/Y mode oscilloscope outputs

  • A PS2 keyboard driver : for typing in commands

  • A character LCD driver : for displaying what uses have typed

  • A seven segment display driver : typically for visualizing debug information

  • User input buttons : for binary user inputs or for stepping through the code at low speeds (It’s pretty hard to visualize things by eye at 50 MHz)

If I attempt to build another CPU in the future I expect a huge portion of the time will focus on the debug interface. I personally find them delightful. For a particularly good set of examples take a look at the Magic processor’s front panel or the currently available PDP-11 replica front panels by Obsolescence Guaranteed. In fact the user interface component is a delightful space to explore for complex systems whether that’s processors, synths, control systems, or some other complex system that needs to precisely communicate information to a skilled user.

Assembler & Examples

As mentioned before, the processor does not have relative jumps which makes program modifications tedious if we don’t have an assembler. The assembler itself is pretty trivial with most of the point being tracking the addresses of assembled instructions. In fact almost all of the meaningful bits of it can be seen in these yacc/lex snippets:

asm.y The primary grammar of the assembler
program:
       program statement '\n'
       |
       ;
statement:
         express
         | label express
         | label
         | directive
         |
         ;
directive:
         EQU SYMBOL NUMBER {$2->val=$3;}
         | START {pass++;pc=0;}
         | SRC SYMBOL {if(pass>1) printf("--%s\n",$2->name);}
         ;

express:
       OPCODE          {line($1, 0x00);}
       | OPCODE SYMBOL {line($1, $2->val);}
       | OPCODE NUMBER {line($1, $2);}
       ;
label:
     SYMBOL ':'        {$1->val=pc;}
     ;
asm.lex the tokenization of the assembler
;.*         /*Ignore comments*/
[ ]+            /*Ignore whitespace*/
iidx|read|test|swap|spop yylval.string=strdup(yytext); return OPCODE;
load|send|nopp|goto|subx yylval.string=strdup(yytext); return OPCODE;
rpsh|rpop|spsh|trap          yylval.string=strdup(yytext); return OPCODE;
addd|brc[0-2]|jsrr|breq|sidx yylval.string=strdup(yytext); return OPCODE;
brnz|brzz|lidx yylval.string=strdup(yytext); return OPCODE;
#[0-9a-fA-F]+   sscanf(yytext+1,"%X",&yylval); return NUMBER;
%[0-9a-zA-Z]    yylval.number = (char) yytext[1];    return NUMBER;
[_a-zA-Z]+      yylval.sym=intern(yytext);    return SYMBOL;
^\.start        return START;
^\.equ          return EQU;
^\.src          return SRC;
:               return *yytext;
\n              return *yytext;

A program consists of a sequence of statements. Those statements are either labels for code to jump to, opcodes with optional arguments, definitions of constants, or definitions of functions which are printed as comments in the generated code. Easy enough for something as low level as an assembler, right?

So, what does a real program look like? Well, let’s look at a slightly modified assembly project. Before looking at the code this is from the project where a collection of analog values are read from an external chip, then transmitted over a serial connection to be displayed on a PC.

To start off, let’s look at some constants:

;Define constants
.equ        UART_TX     #02
.equ        LEDS        #01
.equ        ADC         #06
.equ        STACKP      #21
.equ        STACK_H     #F0
.equ        BINTOASCII  #08
.equ        ASCII_H     #09
.equ        ASCII_L     #08
.equ        TMP_SEND    #A2
;brc0 is uart ready send

In this program the CPU has a few system peripherals:

  1. A set of on board LEDs to display running values of our input memory mapped at 0x01 (write only)

  2. The serial link to the computer (UART) memory mapped to address 0x02 (write only)

  3. The analog to digital chip mapped at 0x06 (read only)

  4. A binary to ascii-hex conversion ROM mapped at 0x08 (input) and 0x08..0x09 (output)

In addition the stack that is used to store data which is sent to the PC is 0x21..0xF0, which should correspond to 103 8 bit values (in hex) and a null terminator.

In addition, there is a set of comments that indicate that the UART is connected to the CPU’s control line 0 to signal when data can be sent.

;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;
.src        INIT_ROUTINE
;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;
init:
            sidx STACKP     ;config stack
            spsh #4D        ;'M'

;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;
.src        MAIN_ROUTINE
;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;
main:
            goto inputs
in_ret:     goto outputs
out_ret:    goto delay
del_ret:    goto main

The beginning of the program initializes the array/stack which is used to store data which will be transmitted and then it establishes the main program flow. inputs are collected, the input is output to the computer, the system waits, and then it repeats the process.

;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;
.src        INPUT_ROUTINE
;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;
inputs:
            read ADC        ;Get analog value
            nopp

            send BINTOASCII ;Convert to ascii
            send LEDS
            read ASCII_H
            nopp
            rpsh            ;Push values onto stack (char*)
            nopp
            read ASCII_L
            nopp
            rpsh

            swap
            test STACK_H    ;Check for end of loop
            swap
            breq send_data
send_ret:   goto in_ret

Ignore the no-operation instructions here, but as the comments indicate data is read from the ADC peripheral. The data is converted into two bytes of hex-ascii (e.g. 0xfa), then stored on the stack. At the end we check to see if the stack pointer is at its maximal value with the test instruction and if it is at that value we go to the next stage, otherwise we keep gathering more input.

;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;
.src        SEND_ROUTINE
;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;
send_data:
            spsh #0A        ;Add new line
            spsh #00        ;Add null terminator
            sidx STACKP     ;Set start of char*
send_loop:
            lidx            ;Read in next value
            nopp
            brzz send_exit  ;Check for null terminator
            send TMP_SEND
wait:       brc0 do_send
            ;do something useful while waiting
            read ADC
            nopp
            send LEDS
            goto wait

do_send:    read TMP_SEND
            nopp
            send UART_TX    ;Transmit data

            iidx
            goto send_loop

send_exit:
            sidx STACKP      ;reset stack
            spsh #4D         ;'M'
            goto in_ret

;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;
.src        OUTPUT_ROUTINE
;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;
outputs:    goto out_ret

Now with output we’re getting to point we have something more complex. In the first 3 instructions we establish that we have a null terminated string starting at the constant address STACKP. We now want to send this string byte by byte to the UART. This is done by the send_loop. When a character is sent we have to wait for the UART to be ready for the next byte at which point we stay in the wait loop until the control line 0 indicates we can send the next character. When we do send the data it is moved from the TMP_SEND address to the UART_TX memory mapped device and we repeat the process until the full string has been sent.

Rewritten in C this looks like:

uint8_t *stack_pos;
volatile uint8_t *led;
volatile uint8_t *adc;
volatile uint8_t *uart;

void send_data(void)
{
    uint8_t chr;
    (*stack_pos++) = '\n';
    (*stack_pos++) = '\0';
    stack_pos = STACKP;

    while(1) {
        chr = *stack_pos++;
        if(chr == 0)
            break;
        while(uart_busy())
            *led = *adc;
        *uart = chr;
    }
    stack_pos = STACKP;
    (*stack_pos++) = 'M';
}
;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;
.src        DELAY_ROUTINE
;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;
delay:      load #00
delay_loop: addd #FF
            nopp
            nopp
            brnz delay_loop
            goto del_ret

The delay routine is the last one it counts from 0 to 256 (more or less) and we can calculate how long this should take to execute. The processor runs at 50MHz with one instruction per cycle, one loop is 4 instructions (add+2x no-op+branch). So, 256*4+1+1 (for the initial load and final goto), so roughly 20 micro-seconds. Not much time at all.

Quirks

The processor had its fair share of quirks and general limitations. While this is by no means a comprehensive list a few that were pain points at the time:

  • It was a 50MHz, all actions on rising edge of clock, nothing on falling edge. The speed was higher than was needed and the lack of falling edge operations resulted in several quirks around reads and writes bleeding into a 2 cycle instruction

  • No secondary register, so manually allocating the single register to various tasks was fiddly

  • No function call/return opcodes, so subroutine calls are verbose operations where the user had to save their variables before placing a return address on the stack and then making sure the called code ends with popping a value from the stack and unconditionally jumping to the current register value

  • Read is a multi-tick operation and multi-tick instructions do not exist, so no op instructions were manually inserted

  • Occasional bugs where on some revisions jumps would occur before the program counter was incremented.

  • 256 instructions really isn’t a lot. For basic problems sure, that’s loads and when you can write arbitrary peripherals it’s workable, but I did run out of program space more than once in the development of the final project

Looking back

It’s pretty amazing looking back at the specs for the DE1 board. Those boards are still sold, but an updated revision. The newer revision is a heck of a lot more powerful than they used to be. Reading the spec sheet of the individual FPGA on my board I’ve got 18,752 logic elements and the new FPGA simply specifies 77k logic elements which gives me the sense that students aren’t running into those situations where they’ve exhausted every single logic unit on one of these boards while doing classwork. The newer board don’t even need the effort of building a custom CPU since they’ve got a built in ARM core. Heck, the ARM core has access to a full GB of RAM, which is remarkable compared to the 8MB of SDRAM on my board and the 256 bytes of address space of the Mark I CPU.

The Mark I CPU worked, it did its job and while I haven’t accomplished my personal goal of making a CPU from raw transistors it got me a decent part of the way there. If I do go all the way to transistors I’m guessing I’ll go down to 4 bits or have the pcb fab assemble the ocean of transistors. Overall it’s a fun space and simple instruction sets are really interesting. I don’t play around with them as much as I used to, but it is a delight to read through ISRs like the z80 or AVR. There’s a beauty to them that is lost in the pragmatic full instruction sets you see in x86 and arm.

So, I’ll leave this article here for now. I may well get back to it and edit it some more or extend it, but this has been sitting in my drafts folder long enough that I should put it up on my website before it collects another layer of dust.