Quantization Algorithms#

Following are the various algorithms available in Core ML Tools to quantize a model:

  • Post-training (data-free) weight quantization

  • Post-training (data calibration) activation quantization

  • GPTQ algorithm for weight quantization (post-training data calibration)

  • Fine-tuning based algorithm for quantizing weight and/or activations

Post-training (data-free) weight quantization#

This algorithm uses the round-to-nearest (RTN) method to quantize the model weights. This is the fastest approach for quantizing the model weights.

Suggested API(s):

Post-training (data calibration) activation quantization#

This algorithm quantizes the activations using a calibration dataset. The data is passed through the model and the range of values that the activations take is estimated. This estimate is then used to compute the scale / zero-point using the RTN method for quantizing the activations. Suggested API(s):

GPTQ algorithm for weight quantization (post-training data calibration)#

This algorithm is based on the paper GPTQ: Accurate Post-Training Quantization for Generative Pre-trained Transformers. The layerwise compression paradigm helps to compress a sequential model layer-by-layer by minimizing the quantization error while quantizing the weights. Each layer is compressed by minimizing the L2 norm of the difference between the layer’s original outputs and the outputs obtained by using the compressed weights. The outputs are computed using a few samples of training data (around 128 samples are usually sufficient). Once a layer is compressed, the layer’s outputs are used as inputs for compressing the next layer.

Suggested API(s):

Fine-tuning based algorithm for quantizing weight and/or activations#

This algorithm is also known as quantization-aware training (QAT) as described in the paper Quantization and Training of Neural Networks for Efficient Integer-Arithmetic-Only Inference. QAT allows for quantizing both weights and activations. The model is fine-tuned upon simulating quantization on the weights and/or activations to recover the accuracy lost upon quantizing the model.

Suggested API(s):

PyTorch quantization APIs

You can use PyTorch’s quantization APIs directly, and then convert the model to Core ML. However, the converted model performance may not be optimal. The PyTorch API default settings (symmetric asymmetric quantization modes and which ops are quantized) are not optimal for the Core ML stack and Apple hardware. If you use the Core ML Tools coremltools.optimize.torch APIs, as described in this section, the correct default settings are applied automatically.

Impact on accuracy with different modes#

Weight-only post-training quantization (PTQ) using 8-bit precision with per-channel granularity typically preserves the accuracy of the model well. If we further compress the model to 4-bit precision, per-block granularity is required to retain the accuracy.

If the former method (weight-only PTQ 8-bit per-channel) does not work well, then calibration data-based techniques such as GPTQ or fine-tuning based methods such as QAT can be explored.

For activation quantization, the calibration data based approach should work well in most cases. However, if accuracy decreases, quantizing activation using QAT can be used to recover the lost accuracy.

Accuracy data#

Model Name

Config

Optimization Workflow

Compression Ratio

Accuracy

MobileNetv2-1.0

Float16

n/a

1.0

71.86

MobileNetv2-1.0

Weight-only

PTQ

1.92

71.78

MobileNetv2-1.0

Weight & activation

QAT

1.92

71.66 ± 0.04

ResNet50

Float16

n/a

1.0

76.14

ResNet50

Weight-only

PTQ

1.99

76.10

ResNet50

Weight & activation

QAT

1.98

76.80 ± 0.05

MobileViTv2-1.0

Float16

n/a

1.0

78.09

MobileViTv2-1.0

Weight-only

PTQ

1.92

77.66

MobileViTv2-1.0

Weight & activation

QAT

1.89

76.89 ± 0.07