All notable changes to this project will be documented in this file.
The format is based on Keep a Changelog and this project adheres to Semantic Versioning.
- Avoid corner-case in which the sequencer issues the same instruction multiple times when two units become non-ready at the same time
- The lane sequencer now calculates the correct number of elements to be requested by the
operand requesters - When the instruction queue of the SLDU is not empty, read from it to update the commit counter, and not from the incoming request
- Masked operations now target the MASKU as well for the issue checks in the main sequencer
- The granularity of the issue checks/gold tickets is now at a PE level
- Avoid losing hazard-related information in the pipeline between the main sequencer and the operand requesters
- Fix anticipated grant bug from operand requester to LDU, SLDU, MASKU, because of the stream registers. Now, the three units wait for a final true grant before commiting
- The mask unit does not require synchronized lanes anymore to commit an instruction
- The sequencer can issue instructions to non-full units even if the other units are full
- Vector indexed unordered/ordered load (
) - Vector indexed unordered/ordered stores (
) - Vector integer reductions (
) - Introduce the global hazard table in the main sequencer, to provide up-to-date information to the operand requesters about the status of the different dependant instructions
- The main sequencer issues instructions every time the target unit has a non-full instruction queue
- The main sequencer stalls if the instructions target a lane, and its operand requesters are not ready
- New instructions enter the main sequencer with a token that marks them as new, and the related counter is updated upon arrival
- Update
with instructions on how to compile convolutions - Refactor
app - Double the testbench memory size
- Update the
- Fix typo on the build instructions of the README
- Fix Gnuplot installation on GitHub's CI
- The number of elements requested by the Store Unit and the Element Requester now depends both on the requested eew and the past
of the vector of the used register - When the VRF is written and
EMUL > 1
, theeew
of all the interested registers is updated - Memory operations can change EMUL when EEW != VSEW
- The LSU now correctly handles bursts with a saturated length of 256 beats
- AXI transactions on an opposite channel w.r.t. the channel currently in use are started only after the completion of the previous transactions
- Fix the number of elements to be requested for a
instruction - Generate
files before compiling the programs - Clean intermediate app object files with
make clean
- Add a
before stopping the cycle counter, to let the last vector store complete
app to benchmark Ara- CI task to create roofline plots of
, available as artifacts - Vector floating-point compare instructions (
) - Vector single-width floating-point/integer type-convert instructions (
) - Vector widening floating-point/integer type-convert instructions (
) - Vector narrowing floating-point/integer type-convert instructions (
) - Vector whole-register move instruction
- Vector whole-register load/store
- Vector load/store mask
- Whole-register instructions are executed also if
vtype.vl == 0
- Makefile option (
) to generate waveform traces when running simulations with Verilator - Add
kernel, optimized for 7x7 filters - Optimize
kernels for 3x3 filters - Add convolutions to the
app, and print the related roofline plots - Add corner case test to
- Add spill register at the lane edge, to cut the timing-critical interface between the Mask unit and the VFUs
- Increase latency of the 16-bit multiplier from 0 to 1 to cut an in-lane timing-critical path
- Widen CVA6's cache lines
- Implement back-to-back accelerator instruction issue mechanism on CVA6
- Use https protocol when cloning DTC from main Makefile
- Use https protocol for newlib-cygwin in .gitmodules
- Cut a timing-critical path from Addrgen to Sequencer (1 cycle more to start an AXI transaction)
- Cut a timing-critical path in the
, relative to the calculation of the pointer to theVRF
word received from the lanes - Create
wrapper containing Ara, Ariane, and an AXI mux, instantiated from within Ara's SoC - Retime address calculation of the
- Push
operand muxing from the lanes to the Mask Unit - Reduce CVA6's default cache size
- Update Verilator to
- Update bender to
- Fix calculation of
's vector length - Fix
operand's vector length trimming - Mute mask requests on idle lanes
- Mute instructions with vector length zero on the respective
- Fix
's offset calculation - Delay acknowledgment of memory requests if the
is busy
- Format source files in the
folder with clang-format by runningmake format
- Support for the
, and16_lanes
configurations, besides the default4_lanes
- Compile Verilator and Ara's verilated model with LLVM, for a faster compile time.
- Verilator updated to version v4.210.
- Verilation is done with a hierarchical verilation flow
- Replace
's LLC with a simple main memory - Reduce number of words on the main memory, for faster Verilation
- Update
to v1.22.1 - Update
to v0.29.1
- Script to align all the elf sections to the AXI Data Width (the testbench requires it)
- RISC-V V intrinsics can now be compiled
- Add support for
instructions - Add support for strided memory operations
- Add support for stores misaligned w.r.t. the AXI Data Width
- Alignment with lowRISC's coding guidelines
- Update Ara support for RISC-V V extension to V 0.10, with the exception of the instructions that were already missing
- Replace toolchain from GCC to LLVM when compiling for RISC-V V extension
- Update toolchain and SPIKE support to RISC-V V 0.10
- Patches for GCC and SPIKE are no longer required
- Ara benchmarks are now compatible with RISC-V V 0.10
- Fix
definition in the Load Unit - Fix check to discriminate a valid byte in the VRF word, in the Load Unit
- Fix
calculation in the Address Generation Unit - Correctly check whether the generated address corresponds to the vector load or the store unit
- Typos on the ChangeLog's dates
- Remove unwanted latches in the
, anddecoder
- Fix
vl == 0
memory operations bug. Ara correctly tells Ariane that the memory operation is over
- Hardware support for:
- Vector slide instructions (vslideup, vslide1up, vfslide1up, vslidedown, vslide1down, vfslide1down)
- Software implementation of a integer 2D convolution kernel
- CI job to check the conv2d execution on Ara
- Removed dependency to a specific gcc g++ version in Makefile
- Arithmetic and memory vector instructions with
vl == 0
are considered as aNOP
- Increment bit width of the vector length type (
), accounting for vectors whose length isVLMAX
- Fix vector length calculation for the
operand, which depends onvsew
- Fix typo on the
updating logic at the Mask Unit - Update README to highlight dependency with Spike
- Update Bender's link dependency to the public CVA6 repository
- Retrigger the
module if the ModelSim compilation did not succeed
- The
in the common Ara runtime is now a copy from theencoding.h
in the Spike submodule
- Parametrization for FPU and FPU-specific formats support, through the
ara_soc parameter
- GitHub Actions-based CI
- Hardware support for:
- Vector single-width floating-point fused multiply-add instructions (vfnmacc, vfmsac, vfnmsac, vfnmadd, vfmsub, vfnmsub)
- Vector floating-point sign-injection instructions (vfsgnj, vfsgnjn, vfsgnjx)
- Vector widening floating-point add/subtract instructions (vfwadd, vfwsub, vfwadd.w, vfwsub.w)
- Vector widening floating-point multiply instructions (vfwmul)
- Vector widening floating-point fused multiply-add instructions (vfwmacc, vfwnmacc, vfwmsac, vfwnmsac)
- Vector floating-point merge instruction (vfmerge)
- Vector floating-point move instruction (vfmv)
- Contributing guidelines updated to include commit message and C++ code style guidelines
- Hardware support for:
- Vector single-width floating-point add/subtract instructions (vfadd, vfsub, vfrsub)
- Vector single-width floating-point multiply instructions (vfmul)
- Vector single-width floating-point fused multiply-add instructions (vfmacc, vfmadd)
- Vector single-width floating-point min/max instructions (vfmin, vfmax)
- Software implementation of a floating-point matrix multiplication kernel
- Support for a coherent mode between Ara and Ariane
- Snoop AW channel from Ara to L2
- Invalidate Ariane's L1 cache sets accordingly
- Coherent mode can be toggled together with consistent mode using the LSB of CSR 0x702
- Ariane's data cache is active by default
- The matrix multiplication kernel achieves better performance
- It reports the performance and the utilization for several matrix sizes
- Hardware support for:
- Vector single-width integer divide instructions (vdivu, vdiv, vremu, vrem)
- Vector integer comparison instructions (vmseq, vmsne, vmsltu, vmslt, vmsleu, vmsle, vmsgtu, vmsgt)
- Vector carry-out of add-with-carry and subtract-with-borrow instructions (vmadc, vmsbc)
- Runtime measurement functions
- Consistent mode which orders scalar and vector loads/stores.
- Conservative ordering without address comparison
- Consistent mode is enabled per default, can be disabled by clearing the LSB of CSR 0x702.
- Ariane's accelerator dispatcher module was rewritten, fixing a bug where instructions would get skipped.
- The Vector Store unit takes the EEW of the source vector register into account to shuffle the elements before writing them to memory.
- Vector mask instructions (vmand, vmnand, vmandnot, vmxor, vmor, vmnor, vmornot, vmxnor) no longer require the non-compliant constraint that the vector length is divisible by eight.
- Hardware compilation with Verilator
- Software implementation of a matrix multiplication kernel
- The
Makefile target was deprecated. The riscv-tests are now run with the Verilated design, which can be called through theriscv_tests_simv
Makefile target. - The operand queues now take as a parameter the type conversions they support (currently,
, andSupportIntExt8
) - The Vector Multiplier unit now has independant pipelines for each element width.
- Hardware support for:
- Vector single-width integer multiply instructions (vmul, vmulh, vmulhu, vmulhsu)
- Vector single-width integer multiply-add instructions (vmacc, vnmsac, vmadd, vnmsub)
- Vector integer add-with-carry/subtract-with-borrow instructions (vadc, vsbc)
- Vector widening integer multiply instructions (vwmul, vwmulu, vwmulsu)
- Vector widening integer multiply-add instructions (vwmaccu, vwmacc, vwmaccsu, vwmaccus)
- Explicit scan chain signals added to the lane's and Ara's interfaces
- Miscellaneous fixes for compatibility with Synopsys DC
- Send the correct bits of the address to the Vector Register File's banks
- Correctly calculate the initial address of each vector register in the VRF
- Hardware support for:
- Bit-shift instructions (vsll, vsrl, vsra)
- Vector widening integer add/subtract (vwadd, vwaddu, vwsub, vwsubu)
- Vector integer extension (vzext, vsext)
- Vector integer merge and move instructions (vmerge, vmv)
- Vector narrowing integer right shift instructions (vnsrl, vnsra)
- Bender updated to version 0.21.0
- CVA6's forwarding mechanism of operand B for accelerator instructions
Hardware support for:
- Vector configuration instructions (vsetvl/vsetvli)
- Unit-strided vector loads and vector stores
- Basic arithmetic and logic instructions (vand, vor, vxor, vadd, vsub, vrsub, vmin-u, vmax-u)
- Predicated instructions through a mask unit
- Vector mask instructions (vmand, vmnand, vmandnot, vmor, vmnor, vmornot, vmxor, vmxnor)
- Length multipliers
Implementation of a synthesizable Ara SoC top-level
Software support for RISC-V Vector code
Continuous integration tests through riscv-tests executed both with Spike and on Ara