hapenny
is a 32-bit RISC-V CPU implementation that operates internally on
16-bit chunks. This means it takes longer to do things, but uses less space.
This approach was inspired by the MC68000 (1979), which also implemented a
32-bit instruction set using a 16-bit datapath. (hapenny
uses about half as
many cycles per instruction as the MC68000, after optimization.)
hapenny
was written to evaluate the Amaranth HDL.
(The current hapenny
was formerly version 2; once it became mature enough I
removed version 1.)
- Over 12M inst/sec on iCE40 HX1K, while occupying under 800 LCs, or less than 63% of the chip. (Throughput compares favorably to some 32-bit implementations occupying twice the area.)
- Native 16-bit bus allows for simpler peripherals and external RAMs. (Can run out of external 16-bit SRAM with no penalty.)
- Parameterized with knobs for trading off size vs capability.
- Implements the RV32I unprivileged instruction set (currently missing FENCE and SYSTEM).
- Optional interrupt support in the older core. (yet to come in the revised one)
- Written in Python using Amaranth.
There are a bazillion open-source RISC-V CPU implementations out there, which is what happens when you release a well-designed and free-to-implement instruction set spec -- nerds like me will crank out implementations.
I wrote hapenny
as an experiment to see if I could target the space between
the PicoRV32 core and the SERV core, in terms of size and performance. I
specifically wanted to produce a CPU with decent performance that could fit into
an iCE40 HX1K part (like on the Icestick evaluation board) with enough space
left over for useful logic. PicoRV32 doesn't quite fit on that chip; SERV fits
but takes 32-64 cycles per instruction.
Property | PicoRV32-small | hapenny |
SERV |
---|---|---|---|
Datapath width (bits) | 32 | 16 | 1 |
External data bus width | 32 | 16 | 32 |
Average cycles per instruction | 5.429 | 5.525 | 40-ish |
Minimal size on iCE40 (LCs) | 1500-ish | 796 | 200-ish |
Typical MHz on iCE40 | 40s? | 72+ | 40s? |
(Cycles/instruction is measured on Dhrystone. Minimal size is the output
produced by the icestick-smallest.py
script. I would appreciate help getting
apples-to-apples comparison numbers!)
So, basically,
-
hapenny
is significantly smaller than a similarly-configured PicoRV32 core for only 1.7% less performance per clock. (Of course, PicoRV32 is a far more general and well-tested processor, and in practice you'd configure it with performance-enhancing features like a dual-port register file and faster shifts.) -
hapenny
is much faster than SERV, but also about 4x larger. (SERV is also better tested thanhapenny
.)
hapenny
is easy to interface to 16-bit peripherals and external memory with no
(additional) performance loss. This can result in smaller overall designs and
simpler boards. For instance, hapenny
can run at full rate out of the 16-bit
SRAM on the Icoboard.
Independent from the datapath width, I also did some fairly aggressive manual
register retiming in the decoder and datapath, which means hapenny
can often
close timing at higher Fmax than other simple RV32 cores. (I miss automatic
retiming from ASIC toolchains.)
hapenny
executes (most of) the RV32I instruction set in 16-bit pieces. It uses
16-bit memory, a 16-bit (single-ported) register file, and a 16-bit ALU. To
perform 32-bit operations, it uses the same techniques a programmer might use in
software on a 16-bit computer, e.g. "chaining" operations using preserved
carry/zero bits.
All memory interfaces in hapenny
are synchronous, including the register file,
which is another reason why operations take more cycles. The RV32I register file
is comparatively large (at 1024 bits), and using a synchronous register file
ensures that it can be mapped into an FPGA block RAM if desired.
Here's what the CPU does during the timing of a typical instruction like ADD
.
I've color/brightness-coded three different executions that are in flight during
this diagram.
- The "FD-Box" is responsible for fetch and decode, and is always working on the
next instruction. It requires three cycles to fetch both halfwords of an
instruction, and then uses the
DECODE
cycle to do initial instruction decoding and start the read of rs1's low half. (It spends one cycle out of four essentially idle to make the state machines line up conveniently.) - The "EW-Box" is responsible for execute and writeback. It goes through at
least four states in every instruction:
R2L
starts the load of the low half of rs2 from the register file.OPL
operates on the low halves of rs1 and rs2 (or rs1 and an immediate), and also starts the load of the high half of rs1.R2H
andOPH
do the same thing for the high half.
Most instructions take four cycles, as shown in that diagram. Some take more if
they need to do additional things (by adding states), or if they change control
flow such that the FD-Box's speculative fetch was wrong. The CPU test bench
(sim-cpu.py
) measures the cycle timing for every instruction; here's where
things currently stand:
Instruction | Cycles | Notes |
---|---|---|
AUIPC | 4 | |
LUI | 4 | |
JAL | 8 | Includes four-cycle re-fetch penalty |
JALR | 8 | Includes four-cycle re-fetch penalty |
Branch | 5/10 | Not Taken / Taken |
Load | 6 | |
SW | 5 | |
SB/SH | 4 | |
SLT(I)(U) | 6 | |
Shift | 6 + N | N is number of bits shifted |
Other ALU op | 4 |
On the instruction mix in Dhrystone, this yields an average of 5.525 cycles/instruction.
hapenny
uses a very simple bus interface with up to 32-bit addressing. In
practice, applications will wire up fewer than 32 address lines, which saves
space.
Signal | Driver | Width | Description |
---|---|---|---|
addr |
CPU | up to 31 | addresses a halfword, i.e. LSB missing |
data_out |
CPU | 16 | carries data for a write |
lanes |
CPU | 2 | signals a write of either or both byte in a halfword; zero means a load |
valid |
CPU | 1 | when high, indicates that the signals above are valid and starts a bus transaction. |
response |
device | 16 | on the cycle after a load, carries back data from the addressed device. |
The PC can be shrunk separately from the address bus if you know that all program memory appears in e.g. the bottom half of the address space. This further saves space.
The bus interface does not support wait states, to reduce complexity. This makes
it difficult to interface to things like XIP SPI Flash or SDRAM. hapenny
is
really intended for applications that don't rely on such things.
hapenny
exposes a fairly flexible debug interface capable of inspecting
processor state and reading and writing the register file. These feautres are
only available when the processor is halted, which can be achieved by holding
halt_request
high until the processor confirms (at the next instruction
boundary) by asserting halted
. Release halt_request
to resume.
Finally, hapenny
has an RVFI (RISC-V Formal Interface) trace port for
generating a trace of instruction effects, though I haven't wired up the actual
test suite.
Currently, hapenny
does not support interrupts, but I'm planning on changing
this. (An earlier version did, support was removed when I rearchitected the core
for v2.)
-
Written by someone who pretends to be an electrical engineer as a way to procrastinate finishing his slides for a talk.
-
Used for exactly one thing so far, so not exactly battle-hardened.
-
Less general than more mature implementations like PicoRV32 -- e.g. no support for wait states, hardware multiply, coprocessors, or (currently) interrupts.
-
16-bit external data bus means that, currently, 32-bit reads/writes are not atomic -- a problem when interfacing with peripherals with 32-bit memory-mapped registers. (Peripherals with 16-bit memory-mapped registers work fine, however.)
-
Not exactly well factored/commented.
-
Written in Python, so chances are pretty good the code won't keep working across OS updates / minor runtime versions.
hapenny
is implemented using about half the logic of other cheap RV32 cores.
The half-penny, or "ha'penny," is a historical English coin worth (as the name implies) half a penny. So if the other cheap cores cost a penny, this is a ha'penny.