API Overview#
Working with Core ML Models#
Quantizing weights#
You can linearly quantize the weights of your Core ML model by using the
linear_quantize_weights
method as follows:
import coremltools.optimize as cto
op_config = cto.coreml.OpLinearQuantizerConfig(
mode="linear_symmetric", weight_threshold=512
)
config = cto.coreml.OptimizationConfig(global_config=op_config)
compressed_8_bit_model = cto.coreml.linear_quantize_weights(model, config=config)
The method defaults to linear_symmetric
, which uses only per-channel scales and no zero-points.
You can also choose a linear
mode, which uses a zero-point, which may help to get
slightly better accuracy.
For more details on the parameters available in the config, see the following in the API Reference:
Quantizing weights and activations#
You can also quantize the activations of the model, in addition to the weights, to benefit from
the int8
-int8
compute available on the Neural Engine (NE), from iPhone 15 Pro onwards.
activation_config = cto.coreml.OptimizationConfig(
global_config=cto.coreml.experimental.OpActivationLinearQuantizerConfig(
mode="linear_symmetric"
)
)
compressed_model_a8 = cto.coreml.experimental.linear_quantize_activations(
model, activation_config, sample_data
)
After quantizing the activation to 8 bits, you can apply the linear_quantize_weights
API
specified above, to quantize the weights as well, to get an W8A8
model.
Working with PyTorch Models#
Quantizing weights#
Data-free quantization#
To quantize the weights in a data-free manner, use
PostTrainingQuantizer
,
as follows:
import torch
from coremltools.optimize.torch.quantization import PostTrainingQuantizer, \
PostTrainingQuantizerConfig
config = PostTrainingQuantizerConfig.from_dict(
{
"global_config": {
"weight_dtype": "int8",
"granularity": "per_block",
"block_size": 128,
},
"module_type_configs": {
torch.nn.Linear: None
}
}
)
quantizer = PostTrainingQuantizer(model, config)
quantized_model = quantizer.compress()
module_type_configs
lets you specify different configs for different layer types. Here, we are setting the config for linear layers to beNone
to de-select linear layers for quantization.The
granularity
option lets you quantize the weights at different levels of granularity, likeper_block
, where blocks of weights along a channel use the same quantization parameters, orper_channel
, where all elements in a channel share the same quantization parameters. Learn more about the various config options available inPostTrainingQuantizerConfig
.
Calibration data based quantization#
Use LayerwiseCompressor
with the GPTQ
algorithm, as follows:
from coremltools.optimize.torch.quantization import LayerwiseCompressor, \
LayerwiseCompressorConfig
config = LayerwiseCompressorConfig.from_dict(
{
"global_config": {
"algorithm": "gptq",
"weight_dtype": 4,
"granularity": "per_block",
"block_size": 128,
},
"input_cacher": "default",
"calibration_nsamples": 16,
}
)
dataloader = # create a list of input tensors to be used for calibration
quantizer = LayerwiseCompressor(model, config)
compressed_model = quantizer.compress(dataloader)
Quantizing weights and activations#
Calibration data based quantization#
LinearQuantizer
,
as described in the next section, is an API to do quantization aware training (QAT)
for quantizing activations and weights. We can also use the same API for data calibration
based post-training quantization to get a W8A8
model.
We use the calibration data to measure the statistics of activations and weights without actually simulating quantization during the model’s forward pass, and without needing to perform a backward pass. Since the weights are constant and do not change, this amounts to using the round-to-nearest (RTN) approach to quantize them.
import torch
from coremltools.optimize.torch.quantization import (
LinearQuantizer,
LinearQuantizerConfig,
ModuleLinearQuantizerConfig
)
config = LinearQuantizerConfig(
global_config=ModuleLinearQuantizerConfig(
quantization_scheme="symmetric",
milestones=[0, 1000, 1000, 0],
)
)
quantizer = LinearQuantizer(model, config)
quantizer.prepare(example_inputs=[1, 3, 224, 224], inplace=True)
# Only step through quantizer once to enable statistics collection (milestone 0),
# and turn batch norm to inference mode (milestone 3)
quantizer.step()
# Do a forward pass through the model with calibration data
for idx, data in enumerate(dataloader):
with torch.no_grad():
model(data)
model.eval()
quantized_model = quantizer.finalize()
Note that here we set the first and last values of the milestones
parameter to 0
.
The first milestone turns on observers, and setting it to zero ensures that we start measuring
quantization statistics from step 0. The last milestone applies batch norm in inference mode,
which means we do not use the calibration data to update the batch norm statistics. We do this because
we do not want training data to influence the batch norm values. The other two milestones
are used to control when fake quantization simulation is turned on and when observers are turned off.
We can set them to values larger than zero so that they are never turned on.
Quantization Aware Training (QAT)#
We use LinearQuantizer
here as well, with a few extra steps, as demonstrated below.
Specify config in a YAML file:
global_config:
quantization_scheme: symmetric
milestones:
- 0
- 100
- 400
- 200
module_name_configs:
first_layer: null
final_layer: null
Code:
# Initialize the quantizer
config = LinearQuantizerConfig.from_yaml("/path/to/yaml/config.yaml")
quantizer = LinearQuantizer(model, config)
# Prepare the model to insert FakeQuantize layers for QAT
example_input = torch.rand(1, 1, 20, 20)
model = quantizer.prepare(example_inputs=example_input, inplace=True)
# Use quantizer in your PyTorch training loop
for inputs, labels in data:
output = model(inputs)
loss = loss_fn(output, labels)
loss.backward()
optimizer.step()
quantizer.step()
# Convert operations to their quantized counterparts using parameters learnt via QAT
model = quantizer.finalize(inplace=True)
Here, we have written the configuration as a YAML file, and used
module_name_configs
to specify that we do not want the first and last layer to be quantized. In the actual config, you would specify the exact names of the first and last layers to deselect them for quantization. This is typically useful, but not required.A detailed explanation of various stages of quantization can be found in the API Reference for
ModuleLinearQuantizerConfig
.
In QAT, in addition to observing the values of weights and activation tensors
to compute quantization parameters, we also simulate the effects of fake quantization
during training. And instead of just performing forward pass on the model,
we perform full training with an optimizer. The forward and backward pass computations
are conducted in float32
dtype. However, these float32
values follow the
constraints imposed by int8
and quint8
dtypes, for weights and activations respectively.
This allows the model weights to adjust and reduce the error introduced by quantization. Straight-Through Estimation
is used for computing gradients of non-differentiable operations introduced by simulated quantization.
The LinearQuantizer
algorithm is implemented as an extension of
FX Graph Mode Quantization in PyTorch.
It first traces the PyTorch model symbolically to obtain a torch.fx
graph capturing all the operations in the model. It then analyzes this graph,
and inserts FakeQuantize
layers in the graph.
FakeQuantize layer insertion locations are chosen such that model inference on hardware is
optimized and only weights and activations which benefit from quantization are quantized.
Since the prepare method uses prepare_qat_fx
to insert quantization layers, the model returned from the method is a
torch.fx.GraphModule
,
and as a result custom methods defined on the original model class
may not be available on the returned model. Some models, like those with dynamic control
flow, may not be traceable into a torch.fx.GraphModule. We recommend following the
instructions in Limitations of Symbolic Tracing and
FX Graph Mode Quantization User Guide to
update your model first, before using the LinearQuantizer
algorithm.
Converting quantized PyTorch models to Core ML#
You can convert your PyTorch model, once it has been quantized, as you would a normal PyTorch model:
# Convert the PyTorch models to CoreML format
traced_model = torch.jit.trace(model, example_input)
coreml_model = ct.convert(
traced_model,
convert_to="mlprogram",
inputs=[ct.TensorType(shape=example_input.shape)],
minimum_deployment_target=ct.target.iOS17,
)
coreml_model.save("~/quantized_model.mlpackage")
Note that you need to use minimum_deployment_target >= iOS17
when activations are also quantized.