Performance#

Since quantization reduces the size of each weight value, the amount of data to be moved is reduced during prediction. This can lead to benefits with memory-bottlenecked models.

Quantizing the activations may further ease this memory pressure and may lead to more gains when compared to weight-only quantization. However, with activation quantization, you may observe a considerable slowdown in inference for the compute units (CPU and sometimes GPU) that employ load-time weight decompression, since activations are not known at load time, and they need to be decompressed at runtime, slowing down the inference. Therefore, it is recommended to use activation quantization only when your model is fully or mostly running on the Neural Engine (NE).

In newer hardware with A17 Pro or M4 chips, such as iPhone 15 Pro, there is increased throughput possible for int8-int8 compute on Neural Engine, compared to previous versions. This means that activation and weight quantization for networks running on Neural Engine can give even more latency gains. This can be seen in the table below (e.g. The ResNet50 model with W8A8 mode runs considerably faster than its W16A16 equivalent).

For the per-block weight quantization option added in iOS18/macOS15, which is especially useful when employing quantization to 4-bits, one can expect to see great runtime memory gains, as well as latency gains depending on the model, when it is running on the GPU. On the other hand, if the model is running on the NE, it is recommended to use the per-channel scales option.

Performance Benchmarks:#

Methodology:#

The latency numbers were captured using the Xcode Performance tab, using the median statistic. Compute unit selection is all unless otherwise noted. The latency numbers are sensitive to the device state, and may vary depending on the device state and build versions.

  • Device: iPhone 14 Pro (A16), unless otherwise mentioned

  • iOS build: iOS 17

  • Xcode : Xcode 15

For more details on base models and compression methodology, please refer to docs here.

Model Info#

Model Name

Task

Pre-trained Weights

Dataset

Accuracy Metric

MobileNetv2-1.0

Image Classification

Torchvision

ImageNet

Top-1 Accuracy (%)

ResNet50

Image Classification

Torchvision

ImageNet

Top-1 Accuracy (%)

MobileViTv2-1.0

Image Classification

cvnets

ImageNet

Top-1 Accuracy (%)

Results#

Model Name

Config

Optimization Workflow

Compression Ratio

Accuracy

Latency in ms (per batch) on iPhone 14 Pro

Latency in ms (per batch) on iPhone 15 Pro

MobileNetv2-1.0

Float16

n/a

1.0

71.86

0.48

0.49

MobileNetv2-1.0

Weight-only

Post Training

1.92

71.78

0.45

0.44

MobileNetv2-1.0

Weight & activation

Training Time

1.92

71.66 ± 0.04

0.27

0.20

ResNet50

Float16

n/a

1.0

76.14

1.52

1.38

ResNet50

Weight-only

Post Training

1.99

76.10

1.49

1.50

ResNet50

Weight & activation

Training Time

1.98

76.80 ± 0.05

0.94

0.77

MobileViTv2-1.0

Float16

n/a

1.0

78.09

1.38

1.36

MobileViTv2-1.0

Weight-only

Post Training

1.92

77.66

1.43

1.37

MobileViTv2-1.0

Weight & activation

Training Time

1.89

76.89 ± 0.07

1.18

1.03

Note: The trained and compressed models and the coremltools.optimize.torch config files used for compression can be downloaded by clicking the respective links embedded in the model and config names.