Add FPGA Optimized Register File Version #433

ganoam · 2020-07-29T14:03:39Z

Add a register file, optimized for synthesis on FPGAs supporting
distributed RAM. The register file features two RAM blocks each with 1
sync-write and 3 async read ports. To achieve the behavior of a 2
sync-write / 3 async-read register file, the read access is arbitrated
depending on which block was last written to. For this purpose an
additional array of 1-bit registers is introduced.

Savings for FPGA synthesis are achieved by:

Replacing an Array of FFs with distributed RAM. Example: 31 32-bit
registers as FFs occupy 992 FFs, or 446 LUTs on Xilinx Artix-7 FPGAs.
The equivalent storage capacity using distributed RAM is implemented
by 36 RAM32M primitives (inferrred from generic HDL), or 144
distributed RAM enabled LUTs, and 31 FFs for block selection (16
LUTs).
The distributed RAM primitives have the read- address
decoders already integrated. This saves three 32-bit 32-to-1
multiplexers at the read ports.
Since both write ports unconditionally write to their respective
RAM blocks, the multiplexing of the write ports is also saved. That
is 32 32-bit 2-to-1 multiplexers.

Concrete Savings:

without FPU reg file:
baseline: 7347 LUTs, 2508 FFs
optimized: 5722 LUTs, 1541 FFs
-------------------------------
difference: -1625 LUTS (-22.1%)
-967 FFs (-38.6%)
with FPU reg file:
baseline: 13160 LUTs, 4027 FFs
optimized: 10257 LUTs, 2062 FFs
-------------------------------
difference: -3353 LUTS (-24.6%)
-1965 FFs (-48.8%)

Signed-off-by: ganoam [email protected]

ganoam · 2020-07-29T14:24:55Z

Hi guys,

I have done some optimizations for improved mapping of hardware to FPGA resources on Ibex as part of my master thesis. I have now been instructed to port those optimizations to other cores (cv32e40p, cva and snitch). This is the first simple but effective optimization, exploiting distributed RAM capabilities provided by many FPGAs

I have tested the new register file by executing the cv32-firmware tests using vcs in core-v-verif/cv32/sim/uvmt_cv322.
cv32-riscv-tests complete fine. compliance test I_LB_01 fails with out of bounds read from fffffacb. However, the baseline design fails at the same point, so I assume that's fine for now.

I did not find very extensive tests exercising the register file (particularily for the FPU register file).
If you want me to do more extensive verification , can somebody point me in the right direction?

cheers, Noam

MikeOpenHWGroup · 2020-07-29T15:26:42Z

Hi @ganoam. I can comment on the tests. In the ci directory there is a script called ci_check that runs something called the user regression. Anything that gets a green light from ci_check should be good enough for a merge.

core-v-verif does not have any testing for the FPU at this time. We will be freezing the RTL for CV32E40P with the top-level parameter "FPU" set to 0. Subsequent releases of CV32E40P will include full verification of the FPU and associated APU interface.

Question: are you pulled into the OpenHW Hardware Task Group? They are actively developing a port of the CV32E40P to an FPGA right now.

ganoam · 2020-07-30T12:19:36Z

Hi @MikeOpenHWGroup. Thanks for your answer.

The ci_check runs through fine when using the pulp toolchain - when using my previously installed generic toolchain, hello-world and illegal failed already on the master branch. (if that is unexpected, I will try again with a fresh toolchain build, and provide you with details if it's still the case.)

Edit: It does work with the generic riscv toolchain. I have screwed up my environment variables.

To verify
I ran

./ci_check -s vcs --repo .... -branch ....

The implementation passed the tests hello-world, riscv_ebreak_test_0, riscv_arithmetic_basic_test_0 and illegal.

Due to the output of

./ci_check -s vcs -d

I suspect that was not all that should have been tested. Am I missing something?

Thanks a lot for your help.

To your question: No, I am not in the OpenHW Hardware Task Group. Should / can I become part of it? Altough I am not sure how much work I will be able to put into it - I scheduled to not spend too much time on those optimizations.

MikeOpenHWGroup · 2020-07-31T12:44:30Z

I suspect that was not all that should have been tested. Am I missing something

Yes there is a lot of background there that you could not be aware of. the VCS Makefiles do not yet support all tests in our regression, most notably the Google riscv-dv tests. But if ci_check says your are OK, then the PR is considered safe to merge. We will subject the merged code to our full regression at least daily.

I am not in the OpenHW Hardware Task Group. Should / can I become part of it?

Absolutely.

Add a register file, optimized for synthesis on FPGAs supporting distributed RAM. The register file features two RAM blocks each with 1 sync-write and 3 async read ports. To achieve the behavior of a 2 sync-write / 3 async-read register file, the read access is arbitrated depending on which block was last written to. For this purpose an additional array of *NUM_TOT_WORDS* 1-bit registers is introduced. Savings for FPGA synthesis are achieved by: - Replacing an Array of FFs with distributed RAM. Example: 31 32-bit registers as FFs occupy 992 FFs, or 446 LUTs on Xilinx Artix-7 FPGAs. The equivalent storage capacity using distributed RAM is implemented by 36 RAM32M primitives (inferrred from generic HDL), or 144 distributed RAM enabled LUTs, and 31 FFs for block selection (16 LUTs). - The distributed RAM primitives have the read- address decoders already integrated. This saves three 32-bit 32 to 1 multiplexers at the read ports. - Since both write ports unconditionally write to their respective RAM blocks, the multiplexing of the write ports is also saved. That is 32 32-bit 2 to 1 multiplexers. Concrete Savings: (synthesized for Xilinx Artix-7 FPGA) - without FPU reg file: baseline: 7347 LUTs, 2508 FFs optimized: 5722 LUTs, 1541 FFs ------------------------------- difference: -1625 LUTS (-22.1%) -967 FFs (-38.6%) - with FPU reg file: baseline: 13160 LUTs, 4027 FFs optimized: 10257 LUTs, 2062 FFs ------------------------------- difference: -3353 LUTS (-24.6%) -1965 FFs (-48.8%) Signed-off-by: ganoam <[email protected]>

gautschimi · 2021-01-19T16:18:41Z

I had a look at this MR and it looks good. FPGA mapping is significantly improved. I suggest to merge it

ganoam mentioned this pull request Jul 29, 2020

Add reference to FPGA optimized RegFile for cv32 openhwgroup/programs#149

Closed

ganoam force-pushed the fpga-opt-regfile branch from 1c2b8c7 to acdcc76 Compare July 31, 2020 07:38

ganoam force-pushed the fpga-opt-regfile branch 4 times, most recently from ea467b5 to 53e963a Compare August 7, 2020 13:21

ganoam force-pushed the fpga-opt-regfile branch from 53e963a to 0d91965 Compare August 7, 2020 13:23

Silabs-ArjanB added the WAIVED:CV32E40P Issue does not impact a major release of CV32E40P and is waived label Nov 10, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add FPGA Optimized Register File Version #433

Add FPGA Optimized Register File Version #433

ganoam commented Jul 29, 2020 •

edited

Loading

ganoam commented Jul 29, 2020 •

edited

Loading

MikeOpenHWGroup commented Jul 29, 2020

ganoam commented Jul 30, 2020 •

edited

Loading

MikeOpenHWGroup commented Jul 31, 2020

gautschimi commented Jan 19, 2021

Add FPGA Optimized Register File Version #433

Are you sure you want to change the base?

Add FPGA Optimized Register File Version #433

Conversation

ganoam commented Jul 29, 2020 • edited Loading

ganoam commented Jul 29, 2020 • edited Loading

MikeOpenHWGroup commented Jul 29, 2020

ganoam commented Jul 30, 2020 • edited Loading

MikeOpenHWGroup commented Jul 31, 2020

gautschimi commented Jan 19, 2021

ganoam commented Jul 29, 2020 •

edited

Loading

ganoam commented Jul 29, 2020 •

edited

Loading

ganoam commented Jul 30, 2020 •

edited

Loading