Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add FPGA Optimized Register File Version #433

Open
wants to merge 1 commit into
base: master
Choose a base branch
from

Conversation

ganoam
Copy link

@ganoam ganoam commented Jul 29, 2020

Add a register file, optimized for synthesis on FPGAs supporting
distributed RAM. The register file features two RAM blocks each with 1
sync-write and 3 async read ports. To achieve the behavior of a 2
sync-write / 3 async-read register file, the read access is arbitrated
depending on which block was last written to. For this purpose an
additional array of 1-bit registers is introduced.

Savings for FPGA synthesis are achieved by:

  • Replacing an Array of FFs with distributed RAM. Example: 31 32-bit
    registers as FFs occupy 992 FFs, or 446 LUTs on Xilinx Artix-7 FPGAs.
    The equivalent storage capacity using distributed RAM is implemented
    by 36 RAM32M primitives (inferrred from generic HDL), or 144
    distributed RAM enabled LUTs, and 31 FFs for block selection (16
    LUTs).
  • The distributed RAM primitives have the read- address
    decoders already integrated. This saves three 32-bit 32-to-1
    multiplexers at the read ports.
  • Since both write ports unconditionally write to their respective
    RAM blocks, the multiplexing of the write ports is also saved. That
    is 32 32-bit 2-to-1 multiplexers.

Concrete Savings:

  • without FPU reg file:
    baseline: 7347 LUTs, 2508 FFs
    optimized: 5722 LUTs, 1541 FFs
    -------------------------------
    difference: -1625 LUTS (-22.1%)
    -967 FFs (-38.6%)

  • with FPU reg file:
    baseline: 13160 LUTs, 4027 FFs
    optimized: 10257 LUTs, 2062 FFs
    -------------------------------
    difference: -3353 LUTS (-24.6%)
    -1965 FFs (-48.8%)

Signed-off-by: ganoam [email protected]

@ganoam
Copy link
Author

ganoam commented Jul 29, 2020

Hi guys,

I have done some optimizations for improved mapping of hardware to FPGA resources on Ibex as part of my master thesis. I have now been instructed to port those optimizations to other cores (cv32e40p, cva and snitch). This is the first simple but effective optimization, exploiting distributed RAM capabilities provided by many FPGAs

I have tested the new register file by executing the cv32-firmware tests using vcs in core-v-verif/cv32/sim/uvmt_cv322.
cv32-riscv-tests complete fine. compliance test I_LB_01 fails with out of bounds read from fffffacb. However, the baseline design fails at the same point, so I assume that's fine for now.

I did not find very extensive tests exercising the register file (particularily for the FPU register file).
If you want me to do more extensive verification , can somebody point me in the right direction?

cheers, Noam

@MikeOpenHWGroup
Copy link
Member

Hi @ganoam. I can comment on the tests. In the ci directory there is a script called ci_check that runs something called the user regression. Anything that gets a green light from ci_check should be good enough for a merge.

core-v-verif does not have any testing for the FPU at this time. We will be freezing the RTL for CV32E40P with the top-level parameter "FPU" set to 0. Subsequent releases of CV32E40P will include full verification of the FPU and associated APU interface.

Question: are you pulled into the OpenHW Hardware Task Group? They are actively developing a port of the CV32E40P to an FPGA right now.

@ganoam
Copy link
Author

ganoam commented Jul 30, 2020

Hi @MikeOpenHWGroup. Thanks for your answer.

The ci_check runs through fine when using the pulp toolchain - when using my previously installed generic toolchain, hello-world and illegal failed already on the master branch. (if that is unexpected, I will try again with a fresh toolchain build, and provide you with details if it's still the case.)

Edit: It does work with the generic riscv toolchain. I have screwed up my environment variables.

To verify
I ran

./ci_check -s vcs --repo .... -branch ....

The implementation passed the tests hello-world, riscv_ebreak_test_0, riscv_arithmetic_basic_test_0 and illegal.

Due to the output of

./ci_check -s vcs -d

I suspect that was not all that should have been tested. Am I missing something?

Thanks a lot for your help.

To your question: No, I am not in the OpenHW Hardware Task Group. Should / can I become part of it? Altough I am not sure how much work I will be able to put into it - I scheduled to not spend too much time on those optimizations.

@MikeOpenHWGroup
Copy link
Member

I suspect that was not all that should have been tested. Am I missing something

Yes there is a lot of background there that you could not be aware of. the VCS Makefiles do not yet support all tests in our regression, most notably the Google riscv-dv tests. But if ci_check says your are OK, then the PR is considered safe to merge. We will subject the merged code to our full regression at least daily.

I am not in the OpenHW Hardware Task Group. Should / can I become part of it?

Absolutely.

@ganoam ganoam force-pushed the fpga-opt-regfile branch 4 times, most recently from ea467b5 to 53e963a Compare August 7, 2020 13:21
Add a register file, optimized for synthesis on FPGAs supporting
distributed RAM. The register file features two RAM blocks each with 1
sync-write and 3 async read ports. To achieve the behavior of a 2
sync-write / 3 async-read register file, the read access is arbitrated
depending on which block was last written to. For this purpose an
additional array of *NUM_TOT_WORDS* 1-bit registers is introduced.

Savings for FPGA synthesis are achieved by:
- Replacing an Array of FFs with distributed RAM. Example: 31 32-bit
  registers as FFs occupy 992 FFs, or 446 LUTs on Xilinx Artix-7 FPGAs.
  The equivalent storage capacity using distributed RAM is implemented
  by 36 RAM32M primitives (inferrred from generic HDL), or 144
  distributed RAM enabled LUTs, and 31 FFs for block selection (16
  LUTs).
- The distributed RAM primitives have the read- address
  decoders already integrated. This saves three 32-bit 32 to 1
  multiplexers at the read ports.
- Since both write ports unconditionally write to their respective
  RAM blocks, the multiplexing of the write ports is also saved. That
  is 32 32-bit 2 to 1 multiplexers.

Concrete Savings: (synthesized for Xilinx Artix-7 FPGA)
- without FPU reg file:
        baseline:   7347 LUTs, 2508 FFs
        optimized:  5722 LUTs, 1541 FFs
        -------------------------------
        difference: -1625 LUTS (-22.1%)
                    -967 FFs   (-38.6%)

- with FPU reg file:
        baseline:   13160 LUTs, 4027 FFs
        optimized:  10257 LUTs, 2062 FFs
        -------------------------------
        difference: -3353 LUTS (-24.6%)
                    -1965 FFs  (-48.8%)

Signed-off-by: ganoam <[email protected]>
@Silabs-ArjanB Silabs-ArjanB added the WAIVED:CV32E40P Issue does not impact a major release of CV32E40P and is waived label Nov 10, 2020
@gautschimi
Copy link
Contributor

I had a look at this MR and it looks good. FPGA mapping is significantly improved. I suggest to merge it

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
WAIVED:CV32E40P Issue does not impact a major release of CV32E40P and is waived
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants