Skip to content

Commit

Permalink
Added README
Browse files Browse the repository at this point in the history
  • Loading branch information
definelicht committed May 17, 2022
1 parent 8c5d03f commit 3860e6e
Showing 1 changed file with 115 additions and 0 deletions.
115 changes: 115 additions & 0 deletions README.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,115 @@
# Fast Arbitrary Precision Floating Point on FPGA

A detailed description of the approach implemented in this repository can be
found in our [FCCM'22
paper](https://spcl.inf.ethz.ch/Publications/.pdf/apfp.pdf) [1].

## Introduction

This repository implements an arbitrary precision floating point multiplier and
adder using Vitis HLS targeting XRT-enabled Xilinx FPGAs, exposing them through
a matrix multiplication primitive that allows running them at full throughput
without becoming memory bound. The design is _fully pipelined_, yielding a MAC
throughput equivalent to the frequency times the number of compute units
instantiated.

Instantiations of the design on an Alveo U250 accelerator were shown to yield
2.0 GMAC/s of 512-bit matrix-matrix multiplication; an order of magnitude
higher than a 36-core dual-socket Xeon node, corresponding to 375× CPU cores
worth of throughput [1].

## Configuration

The hardware design is configured using CMake. The target Xilinx XRT-enabled
platform must be specified with the `APFP_PLATFORM` parameter. The most
important configuration parameters include:
- The width used for the floating point representation is fixed at compile-time
using the `APFP_BITS` CMake parameter, out of which 63 bits will be used for
the exponent, 1 bit will be used for the sign, and the remaining bits will be
used for the mantissa. The value is currently expected to be a multiple of 512
for the sake of being aligned to the memory interface width.
- To scale the design beyond a single pipelined multiplier, the
`APFP_COMPUTE_UNITS` can be used to replicate the full kernel. Each
instantiation will run a fully independent matrix multiplication unit. These
can be used to collaborate on a single matrix multiplication operation (see
`host/TestMatrixMultiplication.cpp` for an example.
- The floating point multiplier uses Karatsuba decomposition to reduce the
overall resource usage of the design. The decomposition bottoms out at
`APFP_MULT_BASE_BITS`, after which it falls back on naive multiplication using
DSPs as generated by the HLS tool. Similarly, the `APFP_ADD_BASE_BITS`
configures the number of bits to dispatch to the HLS tool's addition
implementation, manually pipelining the addition into multiple stages above
this threshold.
- To avoid being memory bound, the matrix multiplication implementation is
tiled using the approach described in our [FPGA'20
paper](https://spcl.inf.ethz.ch/Publications/.pdf/gemm-fpga.pdf) [2]. The
tile sizes are exposed through the `APFP_TILE_SIZE_N` and `APFP_TILE_SIZE_M`
parameters. The highest arithmetic intensity is achieved when these two
quantities are equal and maximized, but relatively small tile sizes are
sufficient to overcome the memory bottleneck (e.g., 32x32). Higher tile sizes
increase arithmetic intensity at the cost of BRAM usage, and potential
overhead when the input matrix is not a multiple of the tile size.
- `APFP_FREQUENCY` can be used to change the maximum frequency targeted by the
design. If unspecified, the default of the target platform will be used.

For more details on how to configure the project to achieve high throughput,
see our paper [1].

## Configuration and compilation

Please make sure you clone the repository with `git clone --recursive` or run
`git submodule update --init` after cloning to check out dependencies.

The minimum commands necessary to configure and build the code are:

```bash
mkdir build
cd build
cmake .. # Default parameters
make # Builds software components
make hw # Builds hardware accelerator
```

However, the accelerator should always be configured to match the target system
using the parameters described in the previous section and in our paper [1].
The CMake configuration flow uses
[hlslib](https://github.com/definelicht/hlslib) [3] to locate the Xilinx tools
and expose hardware build targets.

The project depends on Vitis, GMP, and MPFR to successfully configure.

## Running the code

We provide an example host code that runs the matrix multiplication accelerator
on a randomized input in `host/TestMatrixMultiplication.cpp`. See the executable
for usage. An example invocation could be:

```bash
./TestMatrixMultiplicationHardware hw 256 256 256
```

## Installation

To install the project, including both the software interface components and the
hardware accelerator itself (built with `make hw`), simply run `make install`.
The location to install the project in is configured with the
`CMAKE_INSTALL_PREFIX` parameter.

## References

[1] Johannes de Fine Licht, Christopher A. Pattison, Alexandros Nikolaos
Ziogas, David Simmons-Duffin, Torsten Hoefler, _"Fast Arbitrary Precision
Floating Point on FPGA"_, in Proceedings of the 2022 IEEE 30th Annual
International Symposium on Field-Programmable Custom Computing Machines
(FCCM'22). [🔗](https://spcl.inf.ethz.ch/Publications/.pdf/apfp.pdf)

[2] Johannes de Fine Licht, Grzegorz Kwasniewski, and Torsten Hoefler,
_"Flexible Communication Avoiding Matrix Multiplication on FPGA with High-Level
Synthesis"_, in Proceedings of 28th ACM/SIGDA International Symposium on
Field-Programmable Gate Arrays (FPGA'20).
[🔗](https://spcl.inf.ethz.ch/Publications/.pdf/gemm-fpga.pdf)

[3] Johannes de Fine Licht, and Torsten Hoefler. _"hlslib: Software Engineering
for Hardware Design."_, presented at the Fifth International Workshop on
Heterogeneous High-performance Reconfigurable Computing (H2RC'19).
[🔗](https://spcl.inf.ethz.ch/Publications/.pdf/hlslib.pdf)

0 comments on commit 3860e6e

Please sign in to comment.