You can linearly quantize the weights of your Core ML model by using the
linear_quantize_weights method as follows:
import coremltools.optimize.coreml as cto op_config = cto.OpLinearQuantizerConfig(mode="linear_symmetric", weight_threshold=512) config = cto.OptimizationConfig(global_config=op_config) compressed_8_bit_model = cto.linear_quantize_weights(model, config=config)
Quantize Activations Plus Weights
To quantize the activations in addition to the weights, use Training-Time Quantization.
linear_quantize_weights method iterates over the weights of the model. Those weights whose sizes are above the specified
weight_threshold are quantized to the 8-bit range according to the
mode specified in
OpLinearQuantizerConfig. The method defaults to
linear_symmetric, which uses only
scales and no
zero-points. You can also choose a
linear mode which uses a
zero-point as well, which may help to get slightly better accuracy.
For options on how to set different quantization configs for different weights in the same network, see Customizing Ops to Compress.
For more details on the parameters available in the config, see the following in the API Reference:
If your model’s accuracy drops considerably after quantizing the weights of the model, or your model is fully resident on the Neural Engine and you want to see if you can get more latency gains, then consider quantizing both the weights and activation using Training-Time Quantization.