Performance#

Since quantization reduces the size of each weight value, the amount of data to be moved is reduced during prediction. This can lead to benefits with memory-bottlenecked models.

Quantizing the activations may further ease this memory pressure and may lead to more gains when compared to weight-only quantization. However, with activation quantization, you may observe a considerable slowdown in inference for the compute units (CPU and sometimes GPU) that employ load-time weight decompression, since activations are not known at load time, and they need to be decompressed at runtime, slowing down the inference. Therefore, it is recommended to use activation quantization only when your model is fully or mostly running on the Neural Engine (NE).

In newer hardware with A17 Pro or M4 chips, such as iPhone 15 Pro, there is increased throughput possible for int8-int8 compute on Neural Engine, compared to previous versions. This means that activation and weight quantization for networks running on Neural Engine can give even more latency gains. This can be seen in the table below (e.g. The ResNet50 model with W8A8 mode runs considerably faster than its W16A16 equivalent).

For the per-block weight quantization option added in iOS18/macOS15, which is especially useful when employing quantization to 4-bits, one can expect to see great runtime memory gains, as well as latency gains depending on the model, when it is running on the GPU. On the other hand, if the model is running on the NE, it is recommended to use the per-channel scales option.

Performance Benchmarks:#

Methodology:#

The latency numbers were captured using the Xcode Performance tab, using the median statistic. Compute unit selection is all unless otherwise noted. The latency numbers are sensitive to the device state, and may vary depending on the device state and build versions.

Device: iPhone 14 Pro (A16), unless otherwise mentioned
iOS build: iOS 17
Xcode : Xcode 15

For more details on base models and compression methodology, please refer to docs here.

Model Info#

Model Name	Task	Pre-trained Weights	Dataset	Accuracy Metric
MobileNetv2-1.0	Image Classification	Torchvision	ImageNet	Top-1 Accuracy (%)
ResNet50	Image Classification	Torchvision	ImageNet	Top-1 Accuracy (%)
MobileViTv2-1.0	Image Classification	cvnets	ImageNet	Top-1 Accuracy (%)

Results#

Model Name	Config	Optimization Workflow	Compression Ratio	Accuracy	Latency in ms (per batch) on iPhone 14 Pro	Latency in ms (per batch) on iPhone 15 Pro
MobileNetv2-1.0	Float16	n/a	1.0	71.86	0.48	0.49
MobileNetv2-1.0	Weight-only	Post Training	1.92	71.78	0.45	0.44
MobileNetv2-1.0	Weight & activation	Training Time	1.92	71.66 ± 0.04	0.27	0.20
ResNet50	Float16	n/a	1.0	76.14	1.52	1.38
ResNet50	Weight-only	Post Training	1.99	76.10	1.49	1.50
ResNet50	Weight & activation	Training Time	1.98	76.80 ± 0.05	0.94	0.77
MobileViTv2-1.0	Float16	n/a	1.0	78.09	1.38	1.36
MobileViTv2-1.0	Weight-only	Post Training	1.92	77.66	1.43	1.37
MobileViTv2-1.0	Weight & activation	Training Time	1.89	76.89 ± 0.07	1.18	1.03

Note: The trained and compressed models and the coremltools.optimize.torch config files used for compression can be downloaded by clicking the respective links embedded in the model and config names.

Performance

Contents

Performance#

Performance Benchmarks:#

Methodology:#

Model Info#

Results#