Performance#
Compared to storing weights in a dense format with float16 precision, sparse representation saves about two bytes of storage for every zero value. Model size goes down linearly with the level of sparsity introduced.
For a model that is primarily running on the Neural Engine, sparsity typically helps in improving latency. This is made possible by two factors:
It reduces the size of weights to be loaded at inference time, which can speed up inference of networks that are weight memory bound.
When a string of consecutive
0
s are encountered, the Neural Engine may also be able to skip computations, thereby reducing the amount of computation. This can be achieved by choosing higher levels of unstructured sparsity (e.g. 75% or higher) or block-structured sparsity, where zeros occur in blocks of 2 or its multiples.Note that longer fine-tuning with more data is usually needed to preserve accuracy with larger block sizes and higher levels of sparsity.
Models with a lot of linear ops can benefit from inference speed-ups on CPU on newer hardware generations, when using
n:m
sparsity. Here, out of a block ofm
elements,n
are0
s.m
should be a factor of 16 (such as3:4
,7:8
,14:16
, and so on) andn/m >= 0.5
.Pruning can be applied jointly with quantization and palettization to achieve additional latency and memory savings, over and above those achieved by applying those techniques individually.
Performance Benchmarks#
Methodology#
The latency numbers were captured using the Xcode Performance tab, using the median
statistic. Compute unit selection is all
unless otherwise noted. The latency numbers are sensitive to the device state, and may vary depending on the device state and build versions.
Device: iPhone 14 Pro (A16), unless otherwise mentioned
iOS build: iOS17
Xcode : Xcode 15
For more details on base models and compression methodology, please refer to docs here.
Results#
Model Name |
Config |
Optimization Algorithm |
Compression Ratio |
Latency in ms (per batch) |
---|---|---|---|---|
Float16 |
n/a |
1.0 |
0.48 |
|
Unstructured Sparsity 50% |
MagnitudePruner |
1.37 |
0.46 |
|
Unstructured Sparsity 75% |
MagnitudePruner |
1.73 |
0.46 |
|
Float16 |
n/a |
1.0 |
0.13 |
|
MagnitudePruner |
1.73 |
0.12 |
||
MagnitudePruner |
3.06 |
0.12 |
||
Float16 |
n/a |
1.0 |
1.52 |
|
MagnitudePruner |
1.77 |
1.46 |
||
MagnitudePruner |
3.17 |
1.28 |
Note: The trained and compressed models and the coremltools.optimize.torch
config files used for compression can be downloaded by clicking the respective links embedded in the model and config names.