Quantization Algorithms#
Following are the various algorithms available in Core ML Tools to quantize a model:
Post Training (data free) weight quantization
Post Training (data calibration) activation quantization
GPTQ algorithm for weight quantization (post training data calibration)
Fine tuning based algorithm for quantizing weight and/or activations
Post Training (data free) weight quantization#
This algorithm uses the roundtonearest (RTN) method to quantize the model weights. This is the fastest approach for quantizing the model weights.
Suggetsed API(s):
coremltools.optimize.torch.quantization.PostTrainingQuantizer (For Torch models)
coremltools.optimize.coreml.linear_quantize_weights (For Core ML models)
Post Training (data calibration) activation quantization#
This algorithm quantizes the activations using a calibration dataset. The data is passed through the model and the range of values that the activations take is estimated. This estimate is then used to compute the scale / zeropoint using the RTN method for quantizing the activations. Suggetsed API(s):
coremltools.optimize.torch.quantization.LinearQuantizer (For Torch models)
coremltools.optimize.coreml.experimental.linear_quantize_activations(For Core ML models)
GPTQ algorithm for weight quantization (post training data calibration)#
This algorithm is based on the paper GPTQ: Accurate PostTraining Quantization for Generative Pretrained Transformers. The layerwise compression paradigm helps to compress a sequential model layerbylayer by minimizing the quantization error while quantizing the weights. Each layer is compressed by minimizing the L2 norm of the difference betwen the layer’s original outputs and the outputs obtained by using the compressed weights. The outputs are computed using a few samples of training data (around 128 samples are usually sufficient). Once a layer is compressed, the layer’s outputs are used as inputs for compressing the next layer.
Suggetsed API(s):
Fine tuning based algorithm for quantizing weight and/or activations#
This algorithm is also known as quantizationaware training (QAT) as described in the paper Quantization and Training of Neural Networks for Efficient IntegerArithmeticOnly Inference. QAT allows for quantizing both weights and activations. The model is fine tuned upon simulating quantization on the weights and / or activations to recover the accuracy lost upon quantizing the model.
Suggetsed API(s):
PyTorch quantization APIs
You can use PyTorch’s quantization APIs directly, and then convert the model to Core ML. However, the converted model performance may not be optimal. The PyTorch API default settings (symmetric asymmetric quantization modes and which ops are quantized) are not optimal for the Core ML stack and Apple hardware. If you use the Core ML Tools coremltools.optimize.torch
APIs, as described in this section, the correct default settings are applied automatically.
Impact on accuracy with different modes#
Weightonly posttraining quantization (PTQ) using 8bit precision with perchannel granularity, typically preserves the accuracy of the model well. If we further compress the model to 4bit precision, perblock granularity is required to retain the accuracy.
If the former method (weightonly PTQ 8bit perchannel) does not work well, then calibration databased techniques such as GPTQ or fine tuning based methods such as QAT can be explored.
For activation quantization, the calibration data based approach should work well in most cases. However, if a loss of accuracy is seen, quantizing activation using QAT can be used to recover the lost accuracy.
Accuracy data#
Model Name 
Config 
Optimization Workflow 
Compression Ratio 
Accuracy 

Float16 
n/a 
1.0 
71.86 

Weightonly 
PTQ 
1.92 
71.78 

QAT 
1.92 
71.66 ± 0.04 

Float16 
n/a 
1.0 
76.14 

Weightonly 
PTQ 
1.99 
76.10 

QAT 
1.98 
76.80 ± 0.05 

Float16 
n/a 
1.0 
78.09 

Weightonly 
PTQ 
1.92 
77.66 

QAT 
1.89 
76.89 ± 0.07 