Pruning
- coremltools.optimize.coreml.prune_weights(*args, **kwargs)[source]
Utility function to convert a float precision MLModel of type
mlprogram
to a compressed MLModel using sparse representation. Theconst
ops storing weight values are replaced byconstexpr_sparse_to_dense
ops.This function is useful if the model is trained with pruning techniques so that a lot of weights have zero values. If a large percentage of weight values are zero, a sparse representation is more efficient than a dense one (the default).
The sparsified weights are stored in a bit mask. If the weight values are
{0, 0, 0, 0, 0, 0, 0, 56.3}
, its sparse representation contains a bit mask with ones on locations where the value is non-zero:00000001b
. This is accompanied by non-zero data, which is a size-1 vector of value{56.3}
.For example, given the following:
weight = [0.3, 0, 0, 0.5, 0, 0]
non_zero_data, bit_mask = sparsify(weight)
The indices of the non-zero elements are:
non_zero_data = [0.3, 0.5]
bit_mask = "100100"
- Parameters:
- mlmodel: MLModel
Model to be sparsified. This MLModel should be of type
mlprogram
.- config: OptimizationConfig
An
OptimizationConfig
object that specifies the parameters for weight pruning.- joint_compression: bool
Specification of whether or not to further prune the already-compressed input MLModel to a jointly compressed MLModel. See the prune_weights graph pass for information about which compression schemas could be further pruned.
Take “quantize + prune” as an example of joint compression, where the input MLModel is already quantized, and it will be further pruned. In such an example, the weight values are represented by
constexpr_sparse_blockwise_shift_scale
+constexpr_sparse_to_dense
ops: quantized(sparse) -> constexpr_sparse_blockwise_shift_scale -> weight(sparse) -> constexpr_sparse_to_dense -> weight(dense)
- Returns:
- model: MLModel
The sparse MLModel instance.
Examples
import coremltools as ct import coremltools.optimize as cto model = ct.models.MLModel("my_model.mlpackage") config = cto.coreml.OptimizationConfig( global_config=cto.coreml.OpThresholdPrunerConfig(threshold=1e-12) ) compressed_model = cto.coreml.prune_weights(model, config)
- class coremltools.optimize.coreml.OpThresholdPrunerConfig(threshold: float = 1e-12, minimum_sparsity_percentile: float = 0.5, weight_threshold: int | None = 2048)[source]
All weights with absolute value smaller than
threshold
are changed to0
, and the tensor is stored in a sparse format.For example, given the following:
weight = [0.3, -0.2, -0.01, 0.05]
threshold = 0.03
The sparsified weight would be
[0.3, -0.2, 0, 0.05]
.- Parameters:
- threshold: float
All weight values above this threshold are set to
0
.Default value is
1e-12
.
- minimum_sparsity_percentile: float
The sparsity level must be above this value for the weight representation to be stored in the sparse format rather than the dense format.
For example, if
minimum_sparsity_percentile = 0.6
and the sparisty level is0.54
; that is,54%
of the weight values are exactly0
, then the resulting weight tensor will be stored as a dense const op, and not converted to theconstsexpr_sparse_to_dense
op (which stores the weight values in a sparse format).Must be a value between
0
and1
.Default value is
0.5
.
- weight_threshold: int
The size threshold, above which weights are pruned. That is, a weight tensor is pruned only if its total number of elements are greater than
weight_threshold
.For example, if
weight_threshold = 1024
and a weight tensor is of shape[10, 20, 1, 1]
, hence200
elements, it will not be pruned.If not provided, it will be set to
2048
, in which weights bigger than2048
elements are compressed.
- class coremltools.optimize.coreml.OpMagnitudePrunerConfig(target_sparsity: float | None = None, block_size: int | None = None, n_m_ratio: Tuple[int, int] | None = None, dim: int | None = None, weight_threshold: int | None = 2048)[source]
Prune the weight with a constant sparsity percentile, which can be specified by either
target_sparsity
orn_m_ratio
.If
target_sparsity
is set, wheren = floor(size_of_weight_tensor * target_sparsity)
, then
lowest absolute weight values are changed to0
. For example, given the following:weight = [0.3, -0.2, -0.01, 0.05]
target_sparsity = 0.75
The sparsified weight would be
[0.3, 0, 0, 0]
.If
block_size
is set, then weights are pruned in a block structured manner; that is, chunks of weight values, as big as theblock_size
, will be set to0
. Block sparsity can only be applied tolinear
andconv
layers. For example:# Given a 4 x 2 weight with the following value, and block_size = 2, dim = 0. [ [1, 3], [-6, -7], [0, 3], [-9, 2], ] # We first flatten the matrix along axis = 0. [1, -6, 0, -9, 3, -7, 3, 2] # For block size 2, the L2 norm will be compute of first 2 elements, then the second and 3rd element and so on. [6.08, 9.00, 7.62, 3.61] # Then the smallest values will be picked to prune. So if target_sparsity = 0.5, then the blocks that will be # pruned will be with ones with L2 norm value of 6.08 and 3.61. And hence, the elements in the first and third # block are pruned. Resulting in the following flatten pruned tensor: [0, 0, 0, -9, 3, -7, 0, 0] # The final pruned tensor is: [ [0, 3], [0, -7], [0, 0], [-9, 0], ]
The
n_m_ratio
triggersn:m
pruning along thedim
axis. Inn:m
pruning, out of everym
elements,n
with lowest magnitude are set to0
. For more information, see Learning N:M Fine-Grained Structured Sparse Neural Networks From Scratch.n:m
pruning can be applied only tolinear
andconv
layers.Example
# Given a 4 x 4 weight of [ [3, 4, 7, 6], [1, 8, -3, -8], [-2, -3, -4, 0], [5, 4, -3, -2], ] # For n_m_ratio = (1, 2) with axis = 1 (default), the resulting pruned weight is [ [0, 4, 7, 0], [0, 8, 0, -8], [0, -3, -4, 0], [5, 0, -3, 0], ] # For axis = 0, we get [ [3, 0, 7, 0], [0, 8, 0, -8], [0, 0, -4, 0], [5, 4, 0, -2], ]
- Parameters:
- target_sparsity: float
The percentage of sparsity for compression, which needs to be in the range
[0, 1]
. When0
, no sparsification occurs. For1
, all weights become0
.- block_size: int
Block size for inducing block sparsity. This is applied on the
dim
dimension of the parameter. Having the zeros aligned in the parameter helps gain latency/memory performance on-device.If set, must be greater than
1
to enable block sparsity.Block sparsity can be applied only to
linear
andconv
layers.The channel will be padded with
0
if it is not divisible byblock_size
.
- n_m_ratio: tuple[int]
A tuple of two integers which specify the ratio for
n:m
pruning.n
must be smaller or equal tom
.The channel would be padded with
0
if it is not divisible bym
.
- dim: int
Dimension where the block sparsity or
n:m
sparsity is applied.Must be either
0
or1
.The default value for block sparsity is
0
(output channel).The default value for
n:m
sparsity is1
(input channel).
- weight_threshold: int
The size threshold, above which weights are pruned. That is, a weight tensor is pruned only if its total number of elements is greater than
weight_threshold
.For example, if
weight_threshold = 1024
and a weight tensor is of shape[10, 20, 1, 1]
, hence200
elements, it will not be pruned.If not provided, it will be set to
2048
, in which weights bigger than2048
elements are compressed.