Skip to content

Small 32-bit RISC-V CPU with a half-width datapath inspired by the 68000

Notifications You must be signed in to change notification settings

cbiffle/hapenny

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

hapenny: a half-width RISC-V

hapenny is a 32-bit RISC-V CPU implementation that operates internally on 16-bit chunks. This means it takes longer to do things, but uses less space.

This approach was inspired by the MC68000 (1979), which also implemented a 32-bit instruction set using a 16-bit datapath. (hapenny uses about half as many cycles per instruction as the MC68000, after optimization.)

hapenny was written to evaluate the Amaranth HDL.

(The current hapenny was formerly version 2; once it became mature enough I removed version 1.)

Bullet points

  • Over 12M inst/sec on iCE40 HX1K, while occupying under 800 LCs, or less than 63% of the chip. (Throughput compares favorably to some 32-bit implementations occupying twice the area.)
  • Native 16-bit bus allows for simpler peripherals and external RAMs. (Can run out of external 16-bit SRAM with no penalty.)
  • Parameterized with knobs for trading off size vs capability.
  • Implements the RV32I unprivileged instruction set (currently missing FENCE and SYSTEM).
  • Optional interrupt support in the older core. (yet to come in the revised one)
  • Written in Python using Amaranth.

But why

There are a bazillion open-source RISC-V CPU implementations out there, which is what happens when you release a well-designed and free-to-implement instruction set spec -- nerds like me will crank out implementations.

I wrote hapenny as an experiment to see if I could target the space between the PicoRV32 core and the SERV core, in terms of size and performance. I specifically wanted to produce a CPU with decent performance that could fit into an iCE40 HX1K part (like on the Icestick evaluation board) with enough space left over for useful logic. PicoRV32 doesn't quite fit on that chip; SERV fits but takes 32-64 cycles per instruction.

Property PicoRV32-small hapenny SERV
Datapath width (bits) 32 16 1
External data bus width 32 16 32
Average cycles per instruction 5.429 5.525 40-ish
Minimal size on iCE40 (LCs) 1500-ish 796 200-ish
Typical MHz on iCE40 40s? 72+ 40s?

(Cycles/instruction is measured on Dhrystone. Minimal size is the output produced by the icestick-smallest.py script. I would appreciate help getting apples-to-apples comparison numbers!)

So, basically,

  • hapenny is significantly smaller than a similarly-configured PicoRV32 core for only 1.7% less performance per clock. (Of course, PicoRV32 is a far more general and well-tested processor, and in practice you'd configure it with performance-enhancing features like a dual-port register file and faster shifts.)

  • hapenny is much faster than SERV, but also about 4x larger. (SERV is also better tested than hapenny.)

hapenny is easy to interface to 16-bit peripherals and external memory with no (additional) performance loss. This can result in smaller overall designs and simpler boards. For instance, hapenny can run at full rate out of the 16-bit SRAM on the Icoboard.

Independent from the datapath width, I also did some fairly aggressive manual register retiming in the decoder and datapath, which means hapenny can often close timing at higher Fmax than other simple RV32 cores. (I miss automatic retiming from ASIC toolchains.)

Details

hapenny executes (most of) the RV32I instruction set in 16-bit pieces. It uses 16-bit memory, a 16-bit (single-ported) register file, and a 16-bit ALU. To perform 32-bit operations, it uses the same techniques a programmer might use in software on a 16-bit computer, e.g. "chaining" operations using preserved carry/zero bits.

All memory interfaces in hapenny are synchronous, including the register file, which is another reason why operations take more cycles. The RV32I register file is comparatively large (at 1024 bits), and using a synchronous register file ensures that it can be mapped into an FPGA block RAM if desired.

Here's what the CPU does during the timing of a typical instruction like ADD. I've color/brightness-coded three different executions that are in flight during this diagram.

A timing diagram showing a typical instruction cycle.

  • The "FD-Box" is responsible for fetch and decode, and is always working on the next instruction. It requires three cycles to fetch both halfwords of an instruction, and then uses the DECODE cycle to do initial instruction decoding and start the read of rs1's low half. (It spends one cycle out of four essentially idle to make the state machines line up conveniently.)
  • The "EW-Box" is responsible for execute and writeback. It goes through at least four states in every instruction:
    • R2L starts the load of the low half of rs2 from the register file.
    • OPL operates on the low halves of rs1 and rs2 (or rs1 and an immediate), and also starts the load of the high half of rs1.
    • R2H and OPH do the same thing for the high half.

Most instructions take four cycles, as shown in that diagram. Some take more if they need to do additional things (by adding states), or if they change control flow such that the FD-Box's speculative fetch was wrong. The CPU test bench (sim-cpu.py) measures the cycle timing for every instruction; here's where things currently stand:

Instruction Cycles Notes
AUIPC 4
LUI 4
JAL 8 Includes four-cycle re-fetch penalty
JALR 8 Includes four-cycle re-fetch penalty
Branch 5/10 Not Taken / Taken
Load 6
SW 5
SB/SH 4
SLT(I)(U) 6
Shift 6 + N N is number of bits shifted
Other ALU op 4

On the instruction mix in Dhrystone, this yields an average of 5.525 cycles/instruction.

Interfaces

hapenny uses a very simple bus interface with up to 32-bit addressing. In practice, applications will wire up fewer than 32 address lines, which saves space.

Signal Driver Width Description
addr CPU up to 31 addresses a halfword, i.e. LSB missing
data_out CPU 16 carries data for a write
lanes CPU 2 signals a write of either or both byte in a halfword; zero means a load
valid CPU 1 when high, indicates that the signals above are valid and starts a bus transaction.
response device 16 on the cycle after a load, carries back data from the addressed device.

The PC can be shrunk separately from the address bus if you know that all program memory appears in e.g. the bottom half of the address space. This further saves space.

The bus interface does not support wait states, to reduce complexity. This makes it difficult to interface to things like XIP SPI Flash or SDRAM. hapenny is really intended for applications that don't rely on such things.

hapenny exposes a fairly flexible debug interface capable of inspecting processor state and reading and writing the register file. These feautres are only available when the processor is halted, which can be achieved by holding halt_request high until the processor confirms (at the next instruction boundary) by asserting halted. Release halt_request to resume.

Finally, hapenny has an RVFI (RISC-V Formal Interface) trace port for generating a trace of instruction effects, though I haven't wired up the actual test suite.

Interrupt options

Currently, hapenny does not support interrupts, but I'm planning on changing this. (An earlier version did, support was removed when I rearchitected the core for v2.)

Drawbacks

  • Written by someone who pretends to be an electrical engineer as a way to procrastinate finishing his slides for a talk.

  • Used for exactly one thing so far, so not exactly battle-hardened.

  • Less general than more mature implementations like PicoRV32 -- e.g. no support for wait states, hardware multiply, coprocessors, or (currently) interrupts.

  • 16-bit external data bus means that, currently, 32-bit reads/writes are not atomic -- a problem when interfacing with peripherals with 32-bit memory-mapped registers. (Peripherals with 16-bit memory-mapped registers work fine, however.)

  • Not exactly well factored/commented.

  • Written in Python, so chances are pretty good the code won't keep working across OS updates / minor runtime versions.

What's with the name

hapenny is implemented using about half the logic of other cheap RV32 cores.

The half-penny, or "ha'penny," is a historical English coin worth (as the name implies) half a penny. So if the other cheap cores cost a penny, this is a ha'penny.

About

Small 32-bit RISC-V CPU with a half-width datapath inspired by the 68000

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published