Performance#
Since quantization reduces the size of each weight value, the amount of data to be moved is reduced during prediction. This can lead to benefits with memory-bottlenecked models.
Quantizing the activations may further ease this memory pressure and may lead to more gains when compared to weight-only quantization. However, with activation quantization, you may observe a considerable slowdown in inference for the compute units (CPU and sometimes GPU) that employ load-time weight decompression, since activations are not known at load time, and they need to be decompressed at runtime, slowing down the inference. Therefore, it is recommended to use activation quantization only when your model is fully or mostly running on the Neural Engine (NE).
In newer hardware with A17 Pro or M4 chips, such as iPhone 15 Pro, there is increased throughput possible for int8-int8
compute on Neural Engine, compared to previous versions. This means that activation and weight quantization for networks running on
Neural Engine can give even more latency gains. This can be seen in the table below (e.g.
The ResNet50 model with W8A8
mode runs considerably faster than its W16A16
equivalent).
For the per-block
weight quantization option added in iOS18/macOS15
, which is especially useful when employing
quantization to 4-bits
, one can expect to see great runtime memory gains, as well as latency gains depending on the model,
when it is running on the GPU. On the other hand, if the model is running on the NE, it is recommended to use the
per-channel
scales option.
Performance Benchmarks:#
Methodology:#
The latency numbers were captured using the Xcode Performance tab, using the median
statistic. Compute unit
selection is all
unless otherwise noted. The latency numbers are sensitive to the device state, and may vary depending
on the device state and build versions.
Device: iPhone 14 Pro (A16), unless otherwise mentioned
iOS build: iOS 17
Xcode : Xcode 15
For more details on base models and compression methodology, please refer to docs here.
Model Info#
Model Name |
Task |
Pre-trained Weights |
Dataset |
Accuracy Metric |
---|---|---|---|---|
MobileNetv2-1.0 |
Image Classification |
Top-1 Accuracy (%) |
||
ResNet50 |
Image Classification |
Top-1 Accuracy (%) |
||
MobileViTv2-1.0 |
Image Classification |
cvnets |
Top-1 Accuracy (%) |
Results#
Model Name |
Config |
Optimization Workflow |
Compression Ratio |
Accuracy |
Latency in ms (per batch) on iPhone 14 Pro |
Latency in ms (per batch) on iPhone 15 Pro |
---|---|---|---|---|---|---|
Float16 |
n/a |
1.0 |
71.86 |
0.48 |
0.49 |
|
Weight-only |
Post Training |
1.92 |
71.78 |
0.45 |
0.44 |
|
Training Time |
1.92 |
71.66 ± 0.04 |
0.27 |
0.20 |
||
Float16 |
n/a |
1.0 |
76.14 |
1.52 |
1.38 |
|
Weight-only |
Post Training |
1.99 |
76.10 |
1.49 |
1.50 |
|
Training Time |
1.98 |
76.80 ± 0.05 |
0.94 |
0.77 |
||
Float16 |
n/a |
1.0 |
78.09 |
1.38 |
1.36 |
|
Weight-only |
Post Training |
1.92 |
77.66 |
1.43 |
1.37 |
|
Training Time |
1.89 |
76.89 ± 0.07 |
1.18 |
1.03 |
Note: The trained and compressed models and the coremltools.optimize.torch
config files used for compression can be downloaded by clicking the respective links embedded in the model and config names.