API Overview#
Pruning APIs for Core ML model#
OpMagnitudePrunerConfig
: Prune the weights with a constant sparsity percentile.OpThresholdPrunerConfig
: Sets all weight values below a certain value.
Data free Pruning#
Here is a simple example showing the usage of OpThresholdPrunerConfig
:
from coremltools.optimize.coreml import (
OpThresholdPrunerConfig,
OptimizationConfig,
prune_weights,
)
config = OptimizationConfig(global_config=OpThresholdPrunerConfig(
threshold=0.03
))
model_compressed = prune_weights(model, config=config)
All weight values below a certain value, as specified by
threshold
, is set to zero.
Another way to perform data-free pruning would be using the OpMagnitudePrunerConfig
. Below, we see how to configure it with different config parameters based on the op type and op name:
from coremltools.optimize.coreml import (
OpMagnitudePrunerConfig,
OptimizationConfig,
prune_weights,
)
global_config = OpMagnitudePrunerConfig(
target_sparsity=0.5,
weight_threshold=1024,
)
linear_config = OpMagnitudePrunerConfig(target_sparsity=0.75)
config = OptimizationConfig(
global_config=op_config,
op_type_configs={"linear": linear_config},
op_name_configs={"fc": None}
)
model_compressed = prune_weights(model, config=config)
target_sparsity
: lowest magnitude values uptotarget_sparsity
are set to zero.weight_threshold
: Weight tensors only of size (# of elements) greater thanweight_threshold
are pruned.Sturcutred sparsity such as block structured or
n:m
structured can be applied usingblock_size
andn_m_ratio
respectively.op_type_configs
andop_name_configs
can be used to configure the modules at a more fine-grained level. Here we configure alllinear
layers with 75% sparsity, skip pruning thefc
layer and the remaining layers are pruned to 50% sparsity.The get_weights_metadata() utility provides detailed information about all the weights in the Core ML model which can be used to find the names of the ops to customize.
Pruning APIs for Torch model#
SparseGPT
usingLayerwiseCompressor
: A post training calibration data-based compression algorithm based on the paper SparseGPT: Massive Language Models Can be Accurately Pruned in One-ShotMagnitudePruner
: A weight norm guided pruning algorithm based on the paper To prune or not to prune
Calibration data based Pruning (SparseGPT)#
The following example shows how to compress a model using SparseGPT
and LayerwiseCompressor
. Here we provide the pruning config using a yaml
file. Across all APIs, the configs can be provided either in code via a dictionary structure or via yaml
files.
sparse_gpt_config.yaml
:
algorithm: "sparsegpt",
layers:
- 'model.layer\d+'
global_config:
target_sparsity: 0.5
calibration_nsamples: 125
from coremltools.optimize.torch import (
LayerwiseCompressor,
LayerwiseCompressorConfig,
)
config = LayerwiseCompressorConfig.from_yaml("sparse_gpt_config.yaml")
compressor = LayerwiseCompressor(model, config)
model = compressor.compress(dataloader=dataloader, device=torch.device("cuda"))
algorithm
is set to"sparsegpt"
in theLayerwiseCompressor
algorithm.target_sparsity
: Refers to the amount of sparsity to apply for each layer’s weight tensor.layers
: Layers to be pruned. This is a list of either fully-qualified layer (module) name(s) or a regex for the layer name(s).weight_dtype
,quantization_granularity
andquantization_scheme
can be configured to quantize the non-zero weights for further compression.n:m
structured sparsity can be set through then_m_ratio
option.The
compress
method takes in a dataloader for the calibration dataset as well as the device for performing computation. The dataloader is an iterable of the inputs that need to be fed in to the model.
Data free Pruning#
As mentioned in the previous Pruning Algorithms section, the MagnitudePruner
can be used to perform a data-free pruning to experiment with different pruning structures. In the example below, n:m
structured (with a ratio 6:8
) pruning is applied to the model.
from coremltools.optimize.torch import (
MagnitudePrunerConfig,
MagniutdePruner,
)
config = MagniutdePrunerConfig.from_dict({
"global_config": {
"n_m_ratio": [6, 8]
}
})
pruner = MagniutdePruner(model, config)
pruner.prepare()
model = pruner.finalize()
Here the
ConstantSparsityScheduler
is being used (by default) to prune the model in a data-free manner.
Training time Pruning#
The MagnitudePruner
can be used to introduce sparsity while fine tuning the model to adapt to the loss of accuracy due to the sparsification of the model. In the example below, 75% sparsity is applied to all convolution layers of the model in a gradual incremental manner.
from coremltools.optimize.torch import (
MagnitudePrunerConfig,
MagniutdePruner,
)
config = MagnitudePrunerConfig.from_dict({
"module_type_configs": {
"Conv2d": {
"scheduler": {"update_steps": "range(0, 100, 5)"},
"target_sparsity": 0.75,
"granularity": "per_scalar",
},
}
})
pruner = MagnitudePruner(model, config)
model = pruner.prepare()
for epoch in range(num_epochs):
for inp, label in train_dataloader:
train_step(inp, label)
pruner.step()
pruner.finalize(inplace=True)
target_sparsity
: Refers to the amount of sparsity that the model will finally have.granularity
: One ofper_scalar
,per_kernel
orper_channel
allows to have different ways of structuring the sparisty in the weight tensor.The
scheduler
(above uses thePolynomialDecayScheduler
) incrementally adds sparsity through the course of training to make sure the weights can adapt to the introduction of sparsity. Theupdate_steps
parameter refers to the training steps upon which the sparsity has to be introduced. In this example, the sparsity is applied every 5 steps starting from 0 all the way up to 100.MagnitudePruner.prepare
helps to insert the pruning layers and hooks on to the model.MagnitudePruner.step
incremenetally adds sparsity based on the sparsity schedule described by thescheduler
.MagnitudePruner.finalize
commits all the changes on the model by replacing the pruned weights with zeros.
Converting Torch models to Core ML#
If the Torch model already contains weights that have been zeroed out but are still in a dense representation, the ct.optimize.coreml
APIs mentioned above can be used to generate a sparse representation Core ML model. If the Torch model was pruned using the ct.optimize.torch
APIs mentioned above, then simply calling ct.convert
should be sufficient to generate the sparse Core ML model.
For more details, refer to the PyTorch Conversion Workflow page.