From 3860e6e9037898122a37bcff07ef14e65147020b Mon Sep 17 00:00:00 2001 From: Johannes de Fine Licht Date: Tue, 17 May 2022 04:51:20 +0200 Subject: [PATCH] Added README --- README.md | 115 ++++++++++++++++++++++++++++++++++++++++++++++++++++++ 1 file changed, 115 insertions(+) create mode 100644 README.md diff --git a/README.md b/README.md new file mode 100644 index 0000000..35f2d8b --- /dev/null +++ b/README.md @@ -0,0 +1,115 @@ +# Fast Arbitrary Precision Floating Point on FPGA + +A detailed description of the approach implemented in this repository can be +found in our [FCCM'22 +paper](https://spcl.inf.ethz.ch/Publications/.pdf/apfp.pdf) [1]. + +## Introduction + +This repository implements an arbitrary precision floating point multiplier and +adder using Vitis HLS targeting XRT-enabled Xilinx FPGAs, exposing them through +a matrix multiplication primitive that allows running them at full throughput +without becoming memory bound. The design is _fully pipelined_, yielding a MAC +throughput equivalent to the frequency times the number of compute units +instantiated. + +Instantiations of the design on an Alveo U250 accelerator were shown to yield +2.0 GMAC/s of 512-bit matrix-matrix multiplication; an order of magnitude +higher than a 36-core dual-socket Xeon node, corresponding to 375× CPU cores +worth of throughput [1]. + +## Configuration + +The hardware design is configured using CMake. The target Xilinx XRT-enabled +platform must be specified with the `APFP_PLATFORM` parameter. The most +important configuration parameters include: +- The width used for the floating point representation is fixed at compile-time + using the `APFP_BITS` CMake parameter, out of which 63 bits will be used for + the exponent, 1 bit will be used for the sign, and the remaining bits will be + used for the mantissa. The value is currently expected to be a multiple of 512 + for the sake of being aligned to the memory interface width. +- To scale the design beyond a single pipelined multiplier, the + `APFP_COMPUTE_UNITS` can be used to replicate the full kernel. Each + instantiation will run a fully independent matrix multiplication unit. These + can be used to collaborate on a single matrix multiplication operation (see + `host/TestMatrixMultiplication.cpp` for an example. +- The floating point multiplier uses Karatsuba decomposition to reduce the + overall resource usage of the design. The decomposition bottoms out at + `APFP_MULT_BASE_BITS`, after which it falls back on naive multiplication using + DSPs as generated by the HLS tool. Similarly, the `APFP_ADD_BASE_BITS` + configures the number of bits to dispatch to the HLS tool's addition + implementation, manually pipelining the addition into multiple stages above + this threshold. +- To avoid being memory bound, the matrix multiplication implementation is + tiled using the approach described in our [FPGA'20 + paper](https://spcl.inf.ethz.ch/Publications/.pdf/gemm-fpga.pdf) [2]. The + tile sizes are exposed through the `APFP_TILE_SIZE_N` and `APFP_TILE_SIZE_M` + parameters. The highest arithmetic intensity is achieved when these two + quantities are equal and maximized, but relatively small tile sizes are + sufficient to overcome the memory bottleneck (e.g., 32x32). Higher tile sizes + increase arithmetic intensity at the cost of BRAM usage, and potential + overhead when the input matrix is not a multiple of the tile size. +- `APFP_FREQUENCY` can be used to change the maximum frequency targeted by the + design. If unspecified, the default of the target platform will be used. + +For more details on how to configure the project to achieve high throughput, +see our paper [1]. + +## Configuration and compilation + +Please make sure you clone the repository with `git clone --recursive` or run +`git submodule update --init` after cloning to check out dependencies. + +The minimum commands necessary to configure and build the code are: + +```bash +mkdir build +cd build +cmake .. # Default parameters +make # Builds software components +make hw # Builds hardware accelerator +``` + +However, the accelerator should always be configured to match the target system +using the parameters described in the previous section and in our paper [1]. +The CMake configuration flow uses +[hlslib](https://github.com/definelicht/hlslib) [3] to locate the Xilinx tools +and expose hardware build targets. + +The project depends on Vitis, GMP, and MPFR to successfully configure. + +## Running the code + +We provide an example host code that runs the matrix multiplication accelerator +on a randomized input in `host/TestMatrixMultiplication.cpp`. See the executable +for usage. An example invocation could be: + +```bash +./TestMatrixMultiplicationHardware hw 256 256 256 +``` + +## Installation + +To install the project, including both the software interface components and the +hardware accelerator itself (built with `make hw`), simply run `make install`. +The location to install the project in is configured with the +`CMAKE_INSTALL_PREFIX` parameter. + +## References + +[1] Johannes de Fine Licht, Christopher A. Pattison, Alexandros Nikolaos +Ziogas, David Simmons-Duffin, Torsten Hoefler, _"Fast Arbitrary Precision +Floating Point on FPGA"_, in Proceedings of the 2022 IEEE 30th Annual +International Symposium on Field-Programmable Custom Computing Machines +(FCCM'22). [🔗](https://spcl.inf.ethz.ch/Publications/.pdf/apfp.pdf) + +[2] Johannes de Fine Licht, Grzegorz Kwasniewski, and Torsten Hoefler, +_"Flexible Communication Avoiding Matrix Multiplication on FPGA with High-Level +Synthesis"_, in Proceedings of 28th ACM/SIGDA International Symposium on +Field-Programmable Gate Arrays (FPGA'20). +[🔗](https://spcl.inf.ethz.ch/Publications/.pdf/gemm-fpga.pdf) + +[3] Johannes de Fine Licht, and Torsten Hoefler. _"hlslib: Software Engineering +for Hardware Design."_, presented at the Fifth International Workshop on +Heterogeneous High-performance Reconfigurable Computing (H2RC'19). +[🔗](https://spcl.inf.ethz.ch/Publications/.pdf/hlslib.pdf)