Optimization Workflow#

The workflow to compress a model can be divided into 3 categories. They vary in terms of what is required, how time or data intensive the process is, and how much of the model accuracy can be preserved for a given compression factor. In this section we will go over the workflows, provide recommendations on when to use which ones, and finally an overview of which model formats and coremltools APIs to use for each of the approaches.

Post training data free compression#

Characteristics of this workflow:

  • You just need the model and nothing else (no data or access to training pipeline).

  • Algorithms in this category work by simply minimizing the error between compressed and uncompressed weights.

  • This is the fastest workflow and typically takes a few seconds or up to a few minutes for a large model. While accuracy is very model and task specific, typically, in this workflow accuracy will drop much faster when the amount of compression is increased.

A few examples when you may want to use this workflow:

  • The following two approaches can get a factor of 2x or more compression instantly with minimal loss of accuracy for most models, and decent latency gains depending on the specific model instance:

    • Palettization to 8 or 6 bits.

    • linear quantization of weights to 8 bits

  • In many cases, you may be able to compress more without much degradation. If it works for your model, it is a very quick way to get them up to 4 times smaller compared to the float16 precision version!

    • 4 bits palettization with grouped channel mode (typical group sizes to try: 32, 16, 8)

    • 4 bit weight only quantization with per block mode (typical block sizes to try: 64, 32, 16)

Since on-device performance metrics (latency, memory footprint, model size) depend solely on the compression configuration used, and not on the workflow used for compression, it is always recommended to start with the data free approach to get a quick estimate of latency and runtime performance. For instance, if you find out that, say, 70% sparsity gets you to your desired latency goal, on your target device, then you can look into data calibration / fine tuning to get a model with that config and good accuracy.

Post training calibration data based compression#

  • Small amount of data required (e.g. 128 training samples)

  • With data available, the algorithms in this category can compress weights while accounting for the quantization error in the predicted outputs (final or intermediates).

  • Algorithms used in this class, may or may not be gradient based. Depending on that, you may need to provide a loss function in addition to the data.

Typical examples for when this workflow may be appropriate:

  • Quantizing activations to 8 bits for latency gains on the NE. This requires data, as it is needed to compute the correct scales/offsets for intermediate activations

  • Palettization with 4 bits on large models. In many cases, the accuracy when compared to the data free method can be improved by using a data aware version of the K-means algorithm (available via the cto.torch.palettization.SKMPalettizer API)

  • Similarly, both weight quantization to 4 bits or pruning may do better with calibration data based optimizations which are available via the cto.torch.layerwise_compression API

Model fine tuning based compression#

  • Performs the best in terms of getting better accuracy for higher compression amounts (e.g. 4 bits or lower). Accordingly, this is also the most time and data intensive of all the flows.

  • Even though you will typically start off from a pre-trained model, access to the full training pipeline, along with the training data is required for fine-tuning.

  • A few examples when this approach is appropriate:

    • palettization to 4 bit with single LUT or to lower than 4 bits

    • If activation quantization with calibration data loses accuracy, then QAT (quantization aware training) is required to regain the loss

    • For pruning this is the most effective workflow, and often required to achieve higher levels of sparsity (75% or more) without significant loss of accuracy.

For large models, if the data free or calibration data based techniques lead to high degradation, then it is recommended to first compress the weights of the torch model, say using the data free approach, and then try to regain accuracy by performing parameter efficient fine-tuning, i.e. attaching adapters to the models and fine-tuning those.

Accuracy trade-off with compression

A hypothetical “accuracy-compression amount” trade-off curve to illustrate what you may see on average for different compression workflows.#

Compression workflows

Compression workflows for different input model formats#

APIs for each workflow#

To find a list of APIs, take a look at the whats new page. Below we provide a brief overview of a few of the APIs. Check out the API description page in each of the palettization , quantization and pruning sections for more comprehensive details.

Data free compression#

In this case since all that is needed is the model, you may find it convenient to use the APIs that take in the mlpackage and return a compressed mlpackage. Methods available under coremltools.optimize.coreml will do that for you. PyTorch models are also supported in this flow, as in some cases you may find it more convenient to work with those. For instance, if you are experimenting with multiple rounds of compression, say, applying sparsity with data calibration, followed by data free palettization etc.

Sample pseudocode of applying palettization to an mlpackage model:

import coremltools as ct
import coremltools.optimize as cto

mlmodel = ct.models.MLModel(uncompressed_model_path)
op_config = cto.coreml.OpPalettizerConfig(mode="kmeans",
                                   nbits=4, 
                                   granularity="per_grouped_channel", 
                                   group_size=16) 
model_config = cto.coreml.OptimizationConfig(global_config=op_config)
compressed_mlmodel = cto.coreml.palettize_weights(mlmodel, model_config)

Sample pseudocode of applying palettization to a torch model:

import coremltools as ct
from coremltools.optimize.torch.palettization import PostTrainingPalettizerConfig,\
                                                     PostTrainingPalettizer

config = PostTrainingPalettizerConfig.from_dict({"global_config": 
                                                {
                                                "n_bits": 4,
                                                "granularity": "per_grouped_channel",
                                                "group_size": 16
                                                }
                                                })
palettizer = PostTrainingPalettizer(uncompressed_torch_model, config)
palettized_model = palettizer.compress()

traced_palettized_model = torch.jit.trace(palettized_model, example_input) 
compressed_mlmodel = ct.convert(traced_palettized_model, inputs=...,
                                minimum_deployment_target=ct.target.iOS18)

With calibration dataset#

This flow is mainly available via the coremltools.optimize.torch APIs, as it may require access to the loss function and gradient computation.

Sample pseudocode of applying palettization using the sensitive k-means algorithm on a torch model:

from coremltools.optimize.torch.palettization import SKMPalettizerConfig,\
                                                     SKMPalettizer 

config = SKMPalettizerConfig.from_dict({"global_config": 
                                        {
                                         "n_bits": 4,
                                         "granularity": "per_grouped_channel",
                                         "group_size": 16
                                        }
                                       })
palettizer = SKMPalettizer(uncompressed_torch_model, config)
compressed_torch_model = palettizer.compress(data_loader=..., loss_function=...)

Quantizing activations can be applied either to the torch model, or directly to an mlpackage model as well. Sample pseudocode snippet to do so:

import coremltools as ct 
import coremltools.optimize as cto
# The following API is for coremltools==8.0b1
# It will be moved out of "experimental" in later versions of coremltools 
from coremltools.optimize.coreml.experimental import OpActivationLinearQuantizerConfig, \
                                                     linear_quantize_activations

mlmodel = ct.models.MLModel(uncompressed_model_path)

# quantize activations to 8 bits (this will give an A8W16 model)
act_quant_op_config = OpActivationLinearQuantizerConfig(mode="linear_symmetric")
act_quant_model_config = cto.coreml.OptimizationConfig(global_config=act_quant_op_config)
mlmodel_compressed_activations = linear_quantize_activations(mlmodel, 
                                                             act_quant_model_config,
                                                             sample_data=...)

# quantize weights to 8 bits (this will give an A8W8 model)
weight_quant_op_config = cto.coreml.OpLinearQuantizerConfig(mode="linear_symmetric",
                                                     dtype="int8")
weight_quant_model_config = cto.coreml.OptimizationConfig(weight_quant_op_config)
mlmodel_compressed = cto.coreml.linear_quantize_weights(mlmodel_compressed_activations,
                                                 weight_quant_model_config)

With fine tuning#

This workflow is available only for torch models, via the coremltools.optimize.torch APIs, as it involves integration into the torch training code. This integration can be very easily done by simply modifying the original training code with a few lines of code: mainly invocations to “prepare”, “step” and “finalize” methods. See examples of fine-tuning with palettization, quantization and pruning on an MNIST model to get an overview of the APIs.