API Overview#

Pruning APIs for Core ML model#

OpMagnitudePrunerConfig: Prune the weights with a constant sparsity percentile.
OpThresholdPrunerConfig: Set all weight values below a certain value.

Data free Pruning#

Here is a simple example showing the usage of OpThresholdPrunerConfig:

from coremltools.optimize.coreml import (
    OpThresholdPrunerConfig,
    OptimizationConfig,
    prune_weights,
)

config = OptimizationConfig(global_config=OpThresholdPrunerConfig(
  threshold=0.03
))
model_compressed = prune_weights(model, config=config)

All weight values below a certain value, as specified by threshold, are set to zero.

Another way to perform data-free pruning would be using the OpMagnitudePrunerConfig. Below, we see how to configure it with different config parameters based on the op type and op name:

from coremltools.optimize.coreml import (
    OpMagnitudePrunerConfig,
    OptimizationConfig,
    prune_weights,
)

global_config = OpMagnitudePrunerConfig(
    target_sparsity=0.5,
    weight_threshold=1024,
)
linear_config = OpMagnitudePrunerConfig(target_sparsity=0.75)
config = OptimizationConfig(
  global_config=op_config,
  op_type_configs={"linear": linear_config},
  op_name_configs={"fc": None}
)
model_compressed = prune_weights(model, config=config)

target_sparsity: Lowest magnitude values up to target_sparsity are set to zero.
weight_threshold: Weight tensors only of size (# of elements) greater than weight_threshold are pruned.
Structured sparsity such as block-structured or n:m structured can be applied using block_size and n_m_ratio respectively.
op_type_configs and op_name_configs: Configure the modules at a more fine-grained level. Here we configure all linear layers with 75% sparsity, skip pruning the fc layer, and the remaining layers are pruned to 50% sparsity.
get_weights_metadata(): Utility that provides detailed information about all the weights in the Core ML model, which can be used to find the names of the ops to customize.

Pruning APIs for Torch model#

SparseGPT using LayerwiseCompressor: A post-training calibration data-based compression algorithm based on the paper SparseGPT: Massive Language Models Can be Accurately Pruned in One-Shot
MagnitudePruner: A weight norm guided pruning algorithm based on the paper To prune or not to prune

Calibration data based Pruning (SparseGPT)#

The following example shows how to compress a model using SparseGPT and LayerwiseCompressor. Here we provide the pruning config using a yaml file. Across all APIs, the configs can be provided either in code via a dictionary structure or via yaml files.

sparse_gpt_config.yaml:

algorithm: "sparsegpt"
layers:
  - 'model.layer\d+'
global_config:
  target_sparsity: 0.5
calibration_nsamples: 125

from coremltools.optimize.torch import (
    LayerwiseCompressor,
    LayerwiseCompressorConfig,
)

config = LayerwiseCompressorConfig.from_yaml("sparse_gpt_config.yaml")

compressor = LayerwiseCompressor(model, config)
model = compressor.compress(dataloader=dataloader, device=torch.device("cuda"))

algorithm is set to "sparsegpt" in the LayerwiseCompressor algorithm.
target_sparsity: Refers to the amount of sparsity to apply for each layer’s weight tensor.
layers: Layers to be pruned. This is a list of either fully-qualified layer (module) name(s) or a regex for the layer name(s).
weight_dtype, quantization_granularity and quantization_scheme can be configured to quantize the non-zero weights for further compression.
n:m structured sparsity can be set through the n_m_ratio option.
The compress method takes in a dataloader for the calibration dataset as well as the device for performing computation. The dataloader is an iterable of the inputs that need to be fed into the model.

Data-free Pruning#

As mentioned in the previous Pruning Algorithms section, the MagnitudePruner can be used to perform a data-free pruning to experiment with different pruning structures. In the example below, n:m structured (with a ratio 6:8) pruning is applied to the model.

from coremltools.optimize.torch import (
  MagnitudePrunerConfig,
  MagniutdePruner,
)

config = MagniutdePrunerConfig.from_dict({
  "global_config": {
    "n_m_ratio": [6, 8]
  }
})
pruner = MagniutdePruner(model, config)

pruner.prepare()
model = pruner.finalize()

Here the ConstantSparsityScheduler is being used by default to prune the model in a data-free manner.

Training time Pruning#

The MagnitudePruner can be used to introduce sparsity while fine-tuning the model to adapt to the loss of accuracy due to the sparsification of the model. In the example below, 75% sparsity is applied to all convolution layers of the model in a gradual incremental manner.

from coremltools.optimize.torch import (
  MagnitudePrunerConfig,
  MagniutdePruner,
)

config = MagnitudePrunerConfig.from_dict({
    "module_type_configs": {
      "Conv2d": {
        "scheduler": {"update_steps": "range(0, 100, 5)"},
        "target_sparsity": 0.75,
        "granularity": "per_scalar",
      },
    }
})

pruner = MagnitudePruner(model, config)

model = pruner.prepare()


for epoch in range(num_epochs):
  for inp, label in train_dataloader:
    train_step(inp, label)
    pruner.step()

pruner.finalize(inplace=True)

target_sparsity: Refers to the amount of sparsity that the model will finally have.
granularity: One of per_scalar, per_kernel or per_channel allows for different ways of structuring the sparsity in the weight tensor.
The scheduler (above uses the PolynomialDecayScheduler) incrementally adds sparsity through the course of training to make sure the weights can adapt to the introduction of sparsity. The update_steps parameter refers to the training steps upon which the sparsity has to be introduced. In this example, the sparsity is applied every five steps starting from zero all the way up to 100.
MagnitudePruner.prepare helps to insert the pruning layers and hooks on to the model.
MagnitudePruner.step incremenetally adds sparsity based on the sparsity schedule described by the scheduler.
MagnitudePruner.finalize commits all the changes on the model by replacing the pruned weights with zeros.

Converting Torch models to Core ML#

If the Torch model already contains weights that have been zeroed out but are still in a dense representation, the ct.optimize.coreml APIs mentioned above can be used to generate a sparse representation Core ML model. If the Torch model was pruned using the ct.optimize.torch APIs mentioned above, then simply calling ct.convert should be sufficient to generate the sparse Core ML model.

For more details, refer to PyTorch Conversion Workflow.

API Overview

Contents