Quantization Algorithms#

Following are the various algorithms available in Core ML Tools to quantize a model:

Post-training (data-free) weight quantization
Post-training (data calibration) activation quantization
GPTQ algorithm for weight quantization (post-training data calibration)
Fine-tuning based algorithm for quantizing weight and/or activations

Post-training (data-free) weight quantization#

This algorithm uses the round-to-nearest (RTN) method to quantize the model weights. This is the fastest approach for quantizing the model weights.

Suggested API(s):

coremltools.optimize.torch.quantization.PostTrainingQuantizer (For Torch models)
coremltools.optimize.coreml.linear_quantize_weights (For Core ML models)

Post-training (data calibration) activation quantization#

This algorithm quantizes the activations using a calibration dataset. The data is passed through the model and the range of values that the activations take is estimated. This estimate is then used to compute the scale / zero-point using the RTN method for quantizing the activations. Suggested API(s):

coremltools.optimize.torch.quantization.LinearQuantizer (For Torch models)
coremltools.optimize.coreml.experimental.linear_quantize_activations(For Core ML models)

GPTQ algorithm for weight quantization (post-training data calibration)#

This algorithm is based on the paper GPTQ: Accurate Post-Training Quantization for Generative Pre-trained Transformers. The layerwise compression paradigm helps to compress a sequential model layer-by-layer by minimizing the quantization error while quantizing the weights. Each layer is compressed by minimizing the L2 norm of the difference between the layer’s original outputs and the outputs obtained by using the compressed weights. The outputs are computed using a few samples of training data (around 128 samples are usually sufficient). Once a layer is compressed, the layer’s outputs are used as inputs for compressing the next layer.

Suggested API(s):

coremltools.optimize.torch.layerwise_compression.LayerwiseCompressor

Fine-tuning based algorithm for quantizing weight and/or activations#

This algorithm is also known as quantization-aware training (QAT) as described in the paper Quantization and Training of Neural Networks for Efficient Integer-Arithmetic-Only Inference. QAT allows for quantizing both weights and activations. The model is fine-tuned upon simulating quantization on the weights and/or activations to recover the accuracy lost upon quantizing the model.

Suggested API(s):

coremltools.optimize.torch.quantization.LinearQuantizer

PyTorch quantization APIs

You can use PyTorch’s quantization APIs directly, and then convert the model to Core ML. However, the converted model performance may not be optimal. The PyTorch API default settings (symmetric asymmetric quantization modes and which ops are quantized) are not optimal for the Core ML stack and Apple hardware. If you use the Core ML Tools coremltools.optimize.torch APIs, as described in this section, the correct default settings are applied automatically.

Impact on accuracy with different modes#

Weight-only post-training quantization (PTQ) using 8-bit precision with per-channel granularity typically preserves the accuracy of the model well. If we further compress the model to 4-bit precision, per-block granularity is required to retain the accuracy.

If the former method (weight-only PTQ 8-bit per-channel) does not work well, then calibration data-based techniques such as GPTQ or fine-tuning based methods such as QAT can be explored.

For activation quantization, the calibration data based approach should work well in most cases. However, if accuracy decreases, quantizing activation using QAT can be used to recover the lost accuracy.

Accuracy data#

Model Name	Config	Optimization Workflow	Compression Ratio	Accuracy
MobileNetv2-1.0	Float16	n/a	1.0	71.86
MobileNetv2-1.0	Weight-only	PTQ	1.92	71.78
MobileNetv2-1.0	Weight & activation	QAT	1.92	71.66 ± 0.04
ResNet50	Float16	n/a	1.0	76.14
ResNet50	Weight-only	PTQ	1.99	76.10
ResNet50	Weight & activation	QAT	1.98	76.80 ± 0.05
MobileViTv2-1.0	Float16	n/a	1.0	78.09
MobileViTv2-1.0	Weight-only	PTQ	1.92	77.66
MobileViTv2-1.0	Weight & activation	QAT	1.89	76.89 ± 0.07

Quantization Algorithms

Contents

Quantization Algorithms#

Post-training (data-free) weight quantization#

Post-training (data calibration) activation quantization#

GPTQ algorithm for weight quantization (post-training data calibration)#

Fine-tuning based algorithm for quantizing weight and/or activations#

Impact on accuracy with different modes#

Accuracy data#