-
Notifications
You must be signed in to change notification settings - Fork 6
Commit
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
- Loading branch information
1 parent
8c5d03f
commit 3860e6e
Showing
1 changed file
with
115 additions
and
0 deletions.
There are no files selected for viewing
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,115 @@ | ||
# Fast Arbitrary Precision Floating Point on FPGA | ||
|
||
A detailed description of the approach implemented in this repository can be | ||
found in our [FCCM'22 | ||
paper](https://spcl.inf.ethz.ch/Publications/.pdf/apfp.pdf) [1]. | ||
|
||
## Introduction | ||
|
||
This repository implements an arbitrary precision floating point multiplier and | ||
adder using Vitis HLS targeting XRT-enabled Xilinx FPGAs, exposing them through | ||
a matrix multiplication primitive that allows running them at full throughput | ||
without becoming memory bound. The design is _fully pipelined_, yielding a MAC | ||
throughput equivalent to the frequency times the number of compute units | ||
instantiated. | ||
|
||
Instantiations of the design on an Alveo U250 accelerator were shown to yield | ||
2.0 GMAC/s of 512-bit matrix-matrix multiplication; an order of magnitude | ||
higher than a 36-core dual-socket Xeon node, corresponding to 375× CPU cores | ||
worth of throughput [1]. | ||
|
||
## Configuration | ||
|
||
The hardware design is configured using CMake. The target Xilinx XRT-enabled | ||
platform must be specified with the `APFP_PLATFORM` parameter. The most | ||
important configuration parameters include: | ||
- The width used for the floating point representation is fixed at compile-time | ||
using the `APFP_BITS` CMake parameter, out of which 63 bits will be used for | ||
the exponent, 1 bit will be used for the sign, and the remaining bits will be | ||
used for the mantissa. The value is currently expected to be a multiple of 512 | ||
for the sake of being aligned to the memory interface width. | ||
- To scale the design beyond a single pipelined multiplier, the | ||
`APFP_COMPUTE_UNITS` can be used to replicate the full kernel. Each | ||
instantiation will run a fully independent matrix multiplication unit. These | ||
can be used to collaborate on a single matrix multiplication operation (see | ||
`host/TestMatrixMultiplication.cpp` for an example. | ||
- The floating point multiplier uses Karatsuba decomposition to reduce the | ||
overall resource usage of the design. The decomposition bottoms out at | ||
`APFP_MULT_BASE_BITS`, after which it falls back on naive multiplication using | ||
DSPs as generated by the HLS tool. Similarly, the `APFP_ADD_BASE_BITS` | ||
configures the number of bits to dispatch to the HLS tool's addition | ||
implementation, manually pipelining the addition into multiple stages above | ||
this threshold. | ||
- To avoid being memory bound, the matrix multiplication implementation is | ||
tiled using the approach described in our [FPGA'20 | ||
paper](https://spcl.inf.ethz.ch/Publications/.pdf/gemm-fpga.pdf) [2]. The | ||
tile sizes are exposed through the `APFP_TILE_SIZE_N` and `APFP_TILE_SIZE_M` | ||
parameters. The highest arithmetic intensity is achieved when these two | ||
quantities are equal and maximized, but relatively small tile sizes are | ||
sufficient to overcome the memory bottleneck (e.g., 32x32). Higher tile sizes | ||
increase arithmetic intensity at the cost of BRAM usage, and potential | ||
overhead when the input matrix is not a multiple of the tile size. | ||
- `APFP_FREQUENCY` can be used to change the maximum frequency targeted by the | ||
design. If unspecified, the default of the target platform will be used. | ||
|
||
For more details on how to configure the project to achieve high throughput, | ||
see our paper [1]. | ||
|
||
## Configuration and compilation | ||
|
||
Please make sure you clone the repository with `git clone --recursive` or run | ||
`git submodule update --init` after cloning to check out dependencies. | ||
|
||
The minimum commands necessary to configure and build the code are: | ||
|
||
```bash | ||
mkdir build | ||
cd build | ||
cmake .. # Default parameters | ||
make # Builds software components | ||
make hw # Builds hardware accelerator | ||
``` | ||
|
||
However, the accelerator should always be configured to match the target system | ||
using the parameters described in the previous section and in our paper [1]. | ||
The CMake configuration flow uses | ||
[hlslib](https://github.com/definelicht/hlslib) [3] to locate the Xilinx tools | ||
and expose hardware build targets. | ||
|
||
The project depends on Vitis, GMP, and MPFR to successfully configure. | ||
|
||
## Running the code | ||
|
||
We provide an example host code that runs the matrix multiplication accelerator | ||
on a randomized input in `host/TestMatrixMultiplication.cpp`. See the executable | ||
for usage. An example invocation could be: | ||
|
||
```bash | ||
./TestMatrixMultiplicationHardware hw 256 256 256 | ||
``` | ||
|
||
## Installation | ||
|
||
To install the project, including both the software interface components and the | ||
hardware accelerator itself (built with `make hw`), simply run `make install`. | ||
The location to install the project in is configured with the | ||
`CMAKE_INSTALL_PREFIX` parameter. | ||
|
||
## References | ||
|
||
[1] Johannes de Fine Licht, Christopher A. Pattison, Alexandros Nikolaos | ||
Ziogas, David Simmons-Duffin, Torsten Hoefler, _"Fast Arbitrary Precision | ||
Floating Point on FPGA"_, in Proceedings of the 2022 IEEE 30th Annual | ||
International Symposium on Field-Programmable Custom Computing Machines | ||
(FCCM'22). [🔗](https://spcl.inf.ethz.ch/Publications/.pdf/apfp.pdf) | ||
|
||
[2] Johannes de Fine Licht, Grzegorz Kwasniewski, and Torsten Hoefler, | ||
_"Flexible Communication Avoiding Matrix Multiplication on FPGA with High-Level | ||
Synthesis"_, in Proceedings of 28th ACM/SIGDA International Symposium on | ||
Field-Programmable Gate Arrays (FPGA'20). | ||
[🔗](https://spcl.inf.ethz.ch/Publications/.pdf/gemm-fpga.pdf) | ||
|
||
[3] Johannes de Fine Licht, and Torsten Hoefler. _"hlslib: Software Engineering | ||
for Hardware Design."_, presented at the Fifth International Workshop on | ||
Heterogeneous High-performance Reconfigurable Computing (H2RC'19). | ||
[🔗](https://spcl.inf.ethz.ch/Publications/.pdf/hlslib.pdf) |