optimize.coreml API Overview#

Use coremltools.optimize.coreml (post-training compression) to compress weights in the model. Weight compression reduces the space occupied by the model. However, the precision of the intermediate tensors and the compute precision of the ops are not altered — at load time or prediction time, weights are decompressed into float precision, and all computations use float precision.

Steps to Compress a Model#

The steps to compress an ML Program (mlprogram) model are as follows:

The following is an example of 6-bit palettization:

import coremltools as ct
import coremltools.optimize.coreml as cto

# load model
mlmodel = ct.models.MLModel(uncompressed_model_path)

# define op config 
op_config = cto.OpPalettizerConfig(mode="kmeans", nbits=6)

# define optimization config by applying the op config globally to all ops 
config = cto.OptimizationConfig(global_config=op_config)

# palettize weights
compressed_mlmodel = cto.palettize_weights(mlmodel, config)

Op-Specific Configurations#

For sparsity, the op-specific configs can be defined using cto.OpThresholdPrunerConfig or cto.OpMagnitudePrunerConfig, and the method to prune is cto.prune_weights.

For 8-bit linear quantization, the op-specific configs are defined using cto.OpLinearQuantizerConfig and the method to quantize is cto.linear_quantize_weights .

The OptimizationConfig can also be initialized from a YAML file. Example:

# linear_config.yaml file

config_type: "OpLinearQuantizerConfig"
global_config:
	mode: "linear_symmetric"
	dtype: "int8"
import coremltools.optimize.coreml as cto

config = cto.OptimizationConfig.from_yaml("linear_config.yaml")
compressed_mlmodel = cto.linear_quantize_weights(mlmodel, config)

For details of key parameters to be set in each of the configs, see Post-Training Pruning, Post-Training Palettization and Post-Training Quantization.

Customizing Ops to Compress#

Using the global_config flag in the OptimizationConfig class applies the same config to all the ops with weights in the model.

More granular control can be achieved by using the op_type_configs and op_name_configs flags of OptimizationConfig. In order to get the names of the ops to customize, see the get_weights_metadata() utility, which provides detailed information about all the weights in the network, along with the ops each weight feeds into.

The following example shows 6-bit palettization applied to all ops, with the exception that all the linear ops are set to 8 bits, and two of the conv ops (named conv1 and conv3) are omitted from palettization.

import coremltools.optimize.coreml as cto

global_config = cto.OpPalettizerConfig(nbits=6, mode="kmeans")
linear_config = cto.OpPalettizerConfig(nbits=8, mode="kmeans")
config = cto.OptimizationConfig(
    global_config=global_config,
    op_type_configs={"linear": linear_config},
    op_name_configs={"conv1": None, "conv3": None},
)
compressed_mlmodel = cto.palettize_weights(mlmodel, config)

Using such customizations, different weights in a single model can be compressed with different techniques and configurations.

Requirements for Using optimize.coreml#

Documented in the Post-Training Compression section of the API Reference, the optimize submodule is available in Core ML Tools 7.0 and newer versions. The methods in this submodule are available only for the ML Program (mlprogram) model type, which is the recommended Core ML model format.

For the APIs to compress the weights of neural networks, see Compressing Neural Network Weights.

API Compatibility

Note that coremltools 6.0 provided model compression APIs under the coremltools.compression_utils.* submodule. Those functions are now available under coremltools.optimize.coreml.*