From 3860e6e9037898122a37bcff07ef14e65147020b Mon Sep 17 00:00:00 2001
From: Johannes de Fine Licht <definelicht@inf.ethz.ch>
Date: Tue, 17 May 2022 04:51:20 +0200
Subject: [PATCH] Added README

---
 README.md | 115 ++++++++++++++++++++++++++++++++++++++++++++++++++++++
 1 file changed, 115 insertions(+)
 create mode 100644 README.md

diff --git a/README.md b/README.md
new file mode 100644
index 0000000..35f2d8b
--- /dev/null
+++ b/README.md
@@ -0,0 +1,115 @@
+# Fast Arbitrary Precision Floating Point on FPGA
+
+A detailed description of the approach implemented in this repository can be
+found in our [FCCM'22
+paper](https://spcl.inf.ethz.ch/Publications/.pdf/apfp.pdf) [1].
+
+## Introduction
+
+This repository implements an arbitrary precision floating point multiplier and
+adder using Vitis HLS targeting XRT-enabled Xilinx FPGAs, exposing them through
+a matrix multiplication primitive that allows running them at full throughput
+without becoming memory bound. The design is _fully pipelined_, yielding a MAC
+throughput equivalent to the frequency times the number of compute units
+instantiated.
+
+Instantiations of the design on an Alveo U250 accelerator were shown to yield
+2.0 GMAC/s of 512-bit matrix-matrix multiplication; an order of magnitude
+higher than a 36-core dual-socket Xeon node, corresponding to 375× CPU cores
+worth of throughput [1].
+
+## Configuration
+
+The hardware design is configured using CMake. The target Xilinx XRT-enabled
+platform must be specified with the `APFP_PLATFORM` parameter. The most
+important configuration parameters include:
+- The width used for the floating point representation is fixed at compile-time
+  using the `APFP_BITS` CMake parameter, out of which 63 bits will be used for
+  the exponent, 1 bit will be used for the sign, and the remaining bits will be
+  used for the mantissa. The value is currently expected to be a multiple of 512
+  for the sake of being aligned to the memory interface width.
+- To scale the design beyond a single pipelined multiplier, the
+  `APFP_COMPUTE_UNITS` can be used to replicate the full kernel. Each
+  instantiation will run a fully independent matrix multiplication unit. These
+  can be used to collaborate on a single matrix multiplication operation (see
+  `host/TestMatrixMultiplication.cpp` for an example.
+- The floating point multiplier uses Karatsuba decomposition to reduce the
+  overall resource usage of the design. The decomposition bottoms out at
+  `APFP_MULT_BASE_BITS`, after which it falls back on naive multiplication using
+  DSPs as generated by the HLS tool. Similarly, the `APFP_ADD_BASE_BITS`
+  configures the number of bits to dispatch to the HLS tool's addition
+  implementation, manually pipelining the addition into multiple stages above
+  this threshold.
+- To avoid being memory bound, the matrix multiplication implementation is
+  tiled using the approach described in our [FPGA'20
+  paper](https://spcl.inf.ethz.ch/Publications/.pdf/gemm-fpga.pdf) [2]. The
+  tile sizes are exposed through the `APFP_TILE_SIZE_N` and `APFP_TILE_SIZE_M`
+  parameters. The highest arithmetic intensity is achieved when these two
+  quantities are equal and maximized, but relatively small tile sizes are
+  sufficient to overcome the memory bottleneck (e.g., 32x32). Higher tile sizes
+  increase arithmetic intensity at the cost of BRAM usage, and potential
+  overhead when the input matrix is not a multiple of the tile size.
+- `APFP_FREQUENCY` can be used to change the maximum frequency targeted by the
+  design. If unspecified, the default of the target platform will be used.
+
+For more details on how to configure the project to achieve high throughput,
+see our paper [1].
+
+## Configuration and compilation
+
+Please make sure you clone the repository with `git clone --recursive` or run
+`git submodule update --init` after cloning to check out dependencies.
+
+The minimum commands necessary to configure and build the code are:
+
+```bash
+mkdir build
+cd build
+cmake ..  # Default parameters
+make      # Builds software components
+make hw   # Builds hardware accelerator
+```
+
+However, the accelerator should always be configured to match the target system
+using the parameters described in the previous section and in our paper [1].
+The CMake configuration flow uses
+[hlslib](https://github.com/definelicht/hlslib) [3] to locate the Xilinx tools
+and expose hardware build targets.
+
+The project depends on Vitis, GMP, and MPFR to successfully configure.
+
+## Running the code
+
+We provide an example host code that runs the matrix multiplication accelerator
+on a randomized input in `host/TestMatrixMultiplication.cpp`. See the executable
+for usage. An example invocation could be:
+
+```bash
+./TestMatrixMultiplicationHardware hw 256 256 256
+```
+
+## Installation
+
+To install the project, including both the software interface components and the
+hardware accelerator itself (built with `make hw`), simply run `make install`.
+The location to install the project in is configured with the
+`CMAKE_INSTALL_PREFIX` parameter.
+
+## References
+
+[1] Johannes de Fine Licht, Christopher A. Pattison, Alexandros Nikolaos
+Ziogas, David Simmons-Duffin, Torsten Hoefler, _"Fast Arbitrary Precision
+Floating Point on FPGA"_, in Proceedings of the 2022 IEEE 30th Annual
+International Symposium on Field-Programmable Custom Computing Machines
+(FCCM'22). [🔗](https://spcl.inf.ethz.ch/Publications/.pdf/apfp.pdf)
+
+[2] Johannes de Fine Licht, Grzegorz Kwasniewski, and Torsten Hoefler,
+_"Flexible Communication Avoiding Matrix Multiplication on FPGA with High-Level
+Synthesis"_, in Proceedings of 28th ACM/SIGDA International Symposium on
+Field-Programmable Gate Arrays (FPGA'20).
+[🔗](https://spcl.inf.ethz.ch/Publications/.pdf/gemm-fpga.pdf)
+
+[3] Johannes de Fine Licht, and Torsten Hoefler. _"hlslib: Software Engineering
+for Hardware Design."_, presented at the Fifth International Workshop on
+Heterogeneous High-performance Reconfigurable Computing (H2RC'19).
+[🔗](https://spcl.inf.ethz.ch/Publications/.pdf/hlslib.pdf)