Pruning

coremltools.optimize.coreml.prune_weights(*args, **kwargs)[source]

Utility function to convert a float precision MLModel of type mlprogram to a compressed MLModel using sparse representation. The const ops storing weight values are replaced by constexpr_sparse_to_dense ops.

This function is useful if the model is trained with pruning techniques so that a lot of weights have zero values. If a large percentage of weight values are zero, a sparse representation is more efficient than a dense one (the default).

The sparsified weights are stored in a bit mask. If the weight values are {0, 0, 0, 0, 0, 0, 0, 56.3}, its sparse representation contains a bit mask with ones on locations where the value is non-zero: 00000001b. This is accompanied by non-zero data, which is a size-1 vector of value {56.3}.

For example, given the following:

weight = [0.3, 0, 0, 0.5, 0, 0]

non_zero_data, bit_mask = sparsify(weight)

The indices of the non-zero elements are:

non_zero_data = [0.3, 0.5]

bit_mask = "100100"

Parameters:

mlmodel: MLModel

Model to be sparsified. This MLModel should be of type mlprogram.

config: OptimizationConfig

An OptimizationConfig object that specifies the parameters for weight pruning.

joint_compression: bool

Specification of whether or not to further prune the already-compressed input MLModel to a jointly compressed MLModel. See the prune_weights graph pass for information about which compression schemas could be further pruned.

Take “quantize + prune” as an example of joint compression, where the input MLModel is already quantized, and it will be further pruned. In such an example, the weight values are represented by constexpr_sparse_blockwise_shift_scale + constexpr_sparse_to_dense ops: quantized(sparse) -> constexpr_sparse_blockwise_shift_scale -> weight(sparse) -> constexpr_sparse_to_dense -> weight(dense)

Returns:

model: MLModel: The sparse MLModel instance.

Examples

import coremltools as ct
import coremltools.optimize as cto

model = ct.models.MLModel("my_model.mlpackage")
config = cto.coreml.OptimizationConfig(
    global_config=cto.coreml.OpThresholdPrunerConfig(threshold=1e-12)
)
compressed_model = cto.coreml.prune_weights(model, config)

class coremltools.optimize.coreml.OpThresholdPrunerConfig(threshold: float = 1e-12, minimum_sparsity_percentile: float = 0.5, weight_threshold: int | None = 2048)[source]

All weights with absolute value smaller than threshold are changed to 0, and the tensor is stored in a sparse format.

For example, given the following:

weight = [0.3, -0.2, -0.01, 0.05]

threshold = 0.03

The sparsified weight would be [0.3, -0.2, 0, 0.05].

Parameters:

threshold: float

All weight values above this threshold are set to 0.

Default value is 1e-12.

minimum_sparsity_percentile: float

The sparsity level must be above this value for the weight representation to be stored in the sparse format rather than the dense format.

For example, if minimum_sparsity_percentile = 0.6 and the sparisty level is 0.54; that is, 54% of the weight values are exactly 0, then the resulting weight tensor will be stored as a dense const op, and not converted to the constsexpr_sparse_to_dense op (which stores the weight values in a sparse format).

Must be a value between 0 and 1.
Default value is 0.5.

weight_threshold: int

The size threshold, above which weights are pruned. That is, a weight tensor is pruned only if its total number of elements are greater than weight_threshold.

For example, if weight_threshold = 1024 and a weight tensor is of shape [10, 20, 1, 1], hence 200 elements, it will not be pruned.

If not provided, it will be set to 2048, in which weights bigger than 2048 elements are compressed.

Prune the weight with a constant sparsity percentile, which can be specified by either target_sparsity or n_m_ratio.

If target_sparsity is set, where n = floor(size_of_weight_tensor * target_sparsity), the n lowest absolute weight values are changed to 0. For example, given the following:

weight = [0.3, -0.2, -0.01, 0.05]

target_sparsity = 0.75

The sparsified weight would be [0.3, 0, 0, 0].

If block_size is set, then weights are pruned in a block structured manner; that is, chunks of weight values, as big as the block_size, will be set to 0. Block sparsity can only be applied to linear and conv layers. For example:

# Given a 4 x 2 weight with the following value, and block_size = 2, dim = 0.
[
    [1, 3],
    [-6, -7],
    [0, 3],
    [-9, 2],
]

# We first flatten the matrix along axis = 0.
[1, -6, 0, -9, 3, -7, 3, 2]

# For block size 2, the L2 norm will be compute of first 2 elements, then the second and 3rd element and so on.
[6.08, 9.00, 7.62, 3.61]

# Then the smallest values will be picked to prune. So if target_sparsity = 0.5, then the blocks that will be
# pruned will be with ones with L2 norm value of 6.08 and 3.61. And hence, the elements in the first and third
# block are pruned. Resulting in the following flatten pruned tensor:
[0, 0, 0, -9, 3, -7, 0, 0]

# The final pruned tensor is:
[
    [0, 3],
    [0, -7],
    [0, 0],
    [-9, 0],
]

The n_m_ratio triggers n:m pruning along the dim axis. In n:m pruning, out of every m elements, n with lowest magnitude are set to 0. For more information, see Learning N:M Fine-Grained Structured Sparse Neural Networks From Scratch.

n:m pruning can be applied only to linear and conv layers.

Example

# Given a 4 x 4 weight of
[
    [3, 4, 7, 6],
    [1, 8, -3, -8],
    [-2, -3, -4, 0],
    [5, 4, -3, -2],
]

# For n_m_ratio = (1, 2) with axis = 1 (default), the resulting pruned weight is
[
    [0, 4, 7, 0],
    [0, 8, 0, -8],
    [0, -3, -4, 0],
    [5, 0, -3, 0],
]

# For axis = 0, we get
[
    [3, 0, 7, 0],
    [0, 8, 0, -8],
    [0, 0, -4, 0],
    [5, 4, 0, -2],
]

Parameters:

target_sparsity: float

The percentage of sparsity for compression, which needs to be in the range [0, 1]. When 0, no sparsification occurs. For 1, all weights become 0.

block_size: int

Block size for inducing block sparsity. This is applied on the dim dimension of the parameter. Having the zeros aligned in the parameter helps gain latency/memory performance on-device.

If set, must be greater than 1 to enable block sparsity.
Block sparsity can be applied only to linear and conv layers.
The channel will be padded with 0 if it is not divisible by block_size.

n_m_ratio: tuple[int]

A tuple of two integers which specify the ratio for n:m pruning.

n must be smaller or equal to m.
The channel would be padded with 0 if it is not divisible by m.

dim: int

Dimension where the block sparsity or n:m sparsity is applied.

Must be either 0 or 1.
The default value for block sparsity is 0 (output channel).
The default value for n:m sparsity is 1 (input channel).

weight_threshold: int

The size threshold, above which weights are pruned. That is, a weight tensor is pruned only if its total number of elements is greater than weight_threshold.

For example, if weight_threshold = 1024 and a weight tensor is of shape [10, 20, 1, 1], hence 200 elements, it will not be pruned.

If not provided, it will be set to 2048, in which weights bigger than 2048 elements are compressed.