Skip to content

Commit

Permalink
Add documentation on quantization support in rten
Browse files Browse the repository at this point in the history
 - Add a quantization guide in `docs/quantization.md`
 - Expand the quantization section in the rten crate front matter
 - Add a mention of quantization in the performance guide
  • Loading branch information
robertknight committed Feb 8, 2025
1 parent 62c276b commit 16bcc29
Show file tree
Hide file tree
Showing 3 changed files with 266 additions and 0 deletions.
3 changes: 3 additions & 0 deletions docs/performance.md
Original file line number Diff line number Diff line change
Expand Up @@ -129,6 +129,9 @@ parallelism or other factors.

Some ways to speed up inference without changing RTen's code are:

- Quantize the model to int8. This reduces memory bandwidth usage (especially
important for larger models) and allows the use of hardware instructions
that accelerate int8 dot products and matrix multiplication.
- If choosing from a family of models with different sizes, you can trade
accuracy for performance by using a smaller model.
- If you can break your problem up into chunks, use
Expand Down
258 changes: 258 additions & 0 deletions docs/quantization.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,258 @@
# Quantization support

RTen supports ONNX models that have been quantized to int8 format. Quantization
reduces the file size of the model and can improve inference performance,
depending on the model size and hardware.

The RTen repository contains tools ([ort-quantize.py][ort-quantize]) to assist
with creating quantized models.

This guide explains:

- How quantization works and how it affects performance
- How to quantize ONNX models for use with RTen
- Nuances of quantization support in RTen

To get started with creating and running quantized models using recommended
settings, jump to the "[Quantizing a model](#quantizing-a-model)" section below.

## How quantization works

Quantization means converting a model's weights, and optionally internal
computations, to a smaller data type in order to reduce model size and improve
performance. To do this, float values are mapped to int8 values and associated
scale and zero point such that:

```
float_value = (int8_value - zero_point) * scale
```

Where the zero point is an int8 value and the scale is a float. The zero point
and scale are shared across many tensor elements. There can be one zero point
and scale per row or column, per channel (for an image) or one for the whole
tensor. The zero point can be chosen as zero (_symmetric_ quantization) or
allowed to be non-zero (_asymmetric_).

Quantization can be applied to the weights only, or both the weights and results
of internal computations (the _activations_).

When activations are quantized, the zero point and scale can be computed during
inference, known as _dynamic_ quantization, or offline beforehand, known as
_static_ quantization. When using static quantization, example inputs must
be provided as calibration data. Dynamic quantization is simpler to use and
more accurate, but adds some overhead during inference.

## How quantization affects performance

Quantization can improve performance in two ways:

- By reducing the memory bandwidth required to move weights and activations
from memory into CPU cores for computation.
- By enabling the use of specific hardware instructions for accelerating int8
matrix products and matrix-vector products.

There are however additional computation steps involved in quantized model
inference, which reduces the gain compared to the theoretical maximum. These
include converting tensors between int8 and float types, and calculating the
quantization parameters to use, if using dynamic quantization.

The impact of each of these depends on whether performance is bottlenecked
primarily by compute or memory bandwidth. For small models running on
hardware without dot product instructions, the benefit over f32 may be minimal.
For LLM models with billions of parameters, memory bandwidth is the dominant
factor affecting performance and quantization has a huge impact.

## Quantization support in RTen

### Supported CPU instructions

All CPUs can run int8-quantized models, however performance is significantly
improved if int8 dot product instructions are available.

#### Arm

The Arm [dot product
extensions](https://community.arm.com/arm-community-blogs/b/tools-software-ides-blog/posts/exploring-the-arm-dot-product-instructions)
(aka. SDOT / UDOT) are available in all CPUs that support Arm v8.4 and some
earlier Arm v8.2+ CPUs.

#### x64

The dot product instructions on x86_64 are known as
[VNNI](https://www.intel.com/content/www/us/en/developer/articles/guide/deep-learning-with-avx512-and-dl-boost.html)
or "DL Boost". VNNI comes in several flavors:
[AVX512-VNNI](https://en.wikipedia.org/wiki/Advanced_Vector_Extensions#AVX-512_CPU_compatibility_table)
and
[AVX-VNNI](https://en.wikipedia.org/wiki/Advanced_Vector_Extensions#AVX-VNNI,_AVX-IFMA).

RTen currently supports the AVX512-VNNI variant. Future updates will add
AVX-VNNI support. Enabling VNNI requires compiling rten with nightly Rust and
the `avx512` feature enabled.

#### WebAssembly

The CPU's dot product instructions are exposed in WebAssembly if the
`relaxed-simd` target feature is enabled, but RTen does not currently take
advantage of this.

### Supported data types

ONNX quantization allows the use of different int8 signed-ness (`uint8` vs `int8`)
for weights and activations.

RTen is currently optimized for the case where activations are uint8 and weights
are int8. This is the default choice used by ONNX's dynamic quantization tool
and [ort-quantize.py][ort-quantize]. Other combinations are supported, but may
encounter slower performance due to RTen internally converting to the preferred
format.

### Supported quantization granularity

The granularity of quantization (ie. which elements are quantized together and
share a scale and zero point) can be per-tensor, per-channel or per-block. RTen
currently supports per-tensor and per-channel quantization. There is no
performance advantage to using per-tensor quantization over per-channel.

### Supported quantization symmetry

RTen always uses assumes asymmetric quantization internally (ie. it assumes the
zero point may be non-zero). Hence there is no performance advantage to using
symmetric quantization, as there may be in some other runtimes.

### Saturation hazard on x86_64 CPUs

On x64 systems which do not support VNNI / DL Boost, int8 matrix multiplication
uses a CPU instruction (`VPMADDUBSW`) which can encounter saturation when adding
pairs of int16 values. The workaround for this issue in ONNX is to ensure that
quantized weights are actually 7-bit integers ([-64, 63] for i8, [-128, 127] for
u8) by enabling the "range reduction" setting in the quantization tool. The
[ort-quantize.py][ort-quantize] script in this repository will do this
automatically.

See [Intel's
documentation](https://oneapi-src.github.io/oneDNN/dev_guide_int8_computations.html)
for more information about the issue.

### Supported quantization operators

ONNX has different operators that can be used to represent quantized operations.
RTen supports the "Tensor-oriented" operators (`QuantizeLinear`,
`DequantizeLinear`, `DynamicQuantizeLinear`) as well as integer matrix
multiplication and convolution (`MatMulInteger`, `ConvInteger`). It does not
currently support "QOperator" operators (`QLinearMatMul` and `QLinearConv`).

RTen only supports quantization operators that are part of the ONNX standard
(see [operator list](https://onnx.ai/onnx/operators/)). It does not support
custom operators which are specific to particular runtimes. You may encounter
this when trying to use a quantized model published on the internet to which
"model optimizations" have been applied, as these optimizations may include the
use of runtime-specific operators. If you encounter a problem trying to convert
an existing quantized ONNX model to RTen's format, you can try downloading the
un-quantized model and converting it using the [ort-quantize.py][ort-quantize]
script.

### Weights-only quantization

RTen is currently optimized for running models where both weights and
activations of matrix multiplication and convolutions are quantized using
`MatMulInteger` and `ConvInteger`. Models using weight-only quantization will
run, but with sub-optimal performance because RTen does not yet fuse together
dequantization (`DequantizeLinear`) with the computation operation (`MatMul`,
`Conv` etc.)

## Using an existing quantized model

You can use quantized models downloaded from the internet, provided they only
use standard ONNX operators (see section on supported operators above). If
multiple quantized variants are available, the preferred choice is the one that
uses dynamic quantization with uint8 activations, int8 weights and range
reduction enabled.

If it is unclear which settings were used when quantizing a model, you can
download the model and inspect it using
[Netron](https://github.com/lutzroeder/netron). Search for `MatMulInteger` or
`ConvInteger` operators in the model and see what data type the weights (second
input) have. You can check whether dynamic or static quantization is used by
searching for `DynamicQuantizeLinear` operators. This operator indicates dynamic
quantization. To understand whether range reduction was used, you can check
the range of values for weights and see if they lie in [0, 127] for uint8 or
[-64, 63] for int8.

## Quantizing a model

The easiest way to quantize an fp32 or fp16 model is to use the
[ort-quantize.py][ort-quantize] script in the rten repository. This
will produce an ONNX model which is compatible with both RTen and other ONNX
runtimes.

```
pip install onnx onnxruntime
python tools/ort-quantize.py model.onnx
```

This command will produce a `model.quant.onnx` file, which can then be converted
to `.rten` format for use with RTen using:

```
pip install rten-convert
rten-convert model.quant.onnx
```

This will produce `model.quant.rten`, which you can load and run in RTen
in the same way as an fp32 model.

This script uses quantization settings which prioritizes making the quantization
process simple and producing models which work across a range of hardware.
It may be possible to improve performance and accuracy slightly by using the
underlying ONNX quantization tools with custom settings. For example by using
static rather than dynamic quantization to reduce overhead during inference,
and disabling range reduction if you are targeting hardware not affected by
the x64 saturation hazard.

### Quantizing convolution operators

`ort-quantize.py` does not quantize `Conv` operators by default in order to
produce models which work in ONNX Runtime as well as RTen (see [ONNX Runtime
issue](https://github.com/microsoft/onnxruntime/issues/15888)).

If you only intend to use the model with `rten`, you can pass the
`--quantize-conv` flag which will enable the use of the quantized `ConvInteger`
operator. This can reduce the model size and improve inference performance
of convolution operations.

## Understanding the structure of quantized ONNX models

This section explains how quantization is represented as operators in ONNX
models, which is useful to understand when inspecting a quantized model using a
tool such as eg. [Netron](https://netron.app).

Weights-only quantization is expressed in ONNX by using graphs with a structure
like:

```
MatMul(activations, DequantizeLinear(weights))
```

When weight and activations are both quantized using dynamic quantization, this
produces graphs such as:

```
X_quant, X_scale, X_zero_point = DynamicQuantizeLinear(X)
Y_quant = MatMulInteger(X_quant, W_quant, X_zero_point, W_zero_point)
Y = Cast(Y_quant, fp32)
Y_scaled = Mul(Y, X_scale)
```

To achieve optimal performance, the runtime may "fuse" several steps together.
RTen currently has very limited fusion for quantization operators and depending
on the model this will have varying cost. Better support for fusion of
quantization operators is planned for the future.

## Further reading

- [ONNX quantization
guide](https://onnxruntime.ai/docs/performance/model-optimizations/quantization.html)
- [oneDNN - Nuances of int8 computation on CPU and
GPU](https://oneapi-src.github.io/oneDNN/dev_guide_int8_computations.html#processors-with-the-intel-avx2-or-intel-avx-512-support)

[ort-quantize]: ../tools/ort-quantize.py
5 changes: 5 additions & 0 deletions src/lib.rs
Original file line number Diff line number Diff line change
Expand Up @@ -87,6 +87,11 @@
//! The `tools/ort-quantize.py` script in the RTen repository can be used to
//! quantize an existing model with float tensors into this format.
//!
//! See the [quantization
//! guide](https://github.com/robertknight/rten/blob/main/docs/quantization.md)
//! for a tutorial on how to quantize models and more information about
//! quantization in ONNX and the nuances of quantization support in RTen.
//!
//! # Inspecting models
//!
//! The [rten-cli](https://crates.io/crates/rten-cli) tool can be used to query
Expand Down

0 comments on commit 16bcc29

Please sign in to comment.