Overview#

Quantization refers to the process of reducing the number of bits that represent a number. This process casts values from `float` type to an `integer` type that uses fewer bits.

How Quantization Works#

Linear quantization, also known as affine quantization, achieves this process by mapping the range of float values to a quantized range, such as the range for 8-bit integers [-127, 128], and interpolating linearly.

This mapping is expressed by the following mathematical equations:

```# process of dequantizing weights:
w_unquantized = scale * (w_quantized - zero_point)

# process of quantizing weights:
w_quantized = clip(round(w_unquantized/scale) + zero_point)
```

In the above equations, `w_unquantized` and `scale` are of type float, and `w_quantized` and `zero_point` (also called quantization bias, or offset) are of the quantized data type.

Quantization precision#

For quantizating weights, 8-bit as well as 4-bit precision is supported. For activation quantizations, 8-bit is supported.

Symmetric Quantization#

When quantization is performed, constraining the `zero_point` to be zero is referred to as symmetric quantization. In this case, the quantization and dequantization operations are further simplified. This is the default mode used by Core ML Tools.

Quantization Granularity#

There are three modes supported for `QuantizationGranularity`: `per_tensor`, `per_channel`, `per_block`. `per_tensor` granularity computes a single float scale value (and zero point, in the case of symmetric quantization) for the whole tensor. `per_channel` granularity uses a scale factor for each outer dimension (also referred to as the `output channel`) of the weight tensor. The `per_block` granularity shares scale factors across blocks of values in the weight tensor which helps provide more fine-grained control of quantizing the weight values which contributes to improving the accuracy of the model.

Activation Quantization#

Unlike the Pruning or Palettization compression schemes that compress only weights, for 8-bit quantization, activations of the network can also be quantized with their own scale factors.

Activations are quantized using `per-tensor` mode. During the process of training or passing calibration data through the model, the values of intermediate activations are observed and their max and min values are used to compute the quantization scales, which are stored during inference. Quantizing the intermediate tensors may help in inference of networks that are bottlenecked by memory bandwidth due to large activations.

On newer hardware, e.g. iPhone 15 pro (A17 pro), quantizing both activations and weight to `int8` can leverage optimized compute on the Neural Engine. This can help improve runtime latency in compute-bound models.