Algorithms#
There are a few different ways in which a model’s weights can be palettized. For the same compression factor, each of these approaches can have a different impact on model accuracy. Below we talk about different palettization algorithms that are supported, and some of the considerations to keep in mind when choosing the approach that works well for your use case.
KMeans#
This is a data free post training palettization algorithm where weights are clustered using kmeans_clustering and the derived centroids form the lookup table (LUT).
Since this only requires model weights, it is the easiest algorithm to set up and experiment with. For higher bit palettization, posttraining palettization provides a good compressionaccuracy tradeoff.
However, there is a significant loss in accuracy for lower bits. For lower bits, per_grouped_channel
granularity can be used to recover loss in accuracy.
Supported API(s):
Sensitive KMeans#
Sensitive KMeans is a calibration data based post training palettization algorithm, based on SqueezeLLM: DenseandSparse Quantization. It palettizes weights by running a weighted kmeans on model parameters. These weights, called sensitivity, are computed using an objective function that depends on the Hessian of the model parameters. Since Hessian is a second order derivative and computationally expensive to compute, it is approximated by using the Fisher information matrix, which is computed from the square of gradients easily available given a few calibration input data points and a loss function.
The more sensitive an element, the larger impact perturbing it (or palettizing it) has on the model’s loss function. Thus, weighted kmeans moves the clusters closer to the sensitive weight values, allowing them to be represented more precisely. This generally leads to lower degradation in model accuracy but depends on model type and how accurate the Fischer Information approximation is for that specific model. Typically, 128 samples are sufficient for applying this algorithm. In practice, this algorithm works well, better than data free KMeans, for large transformer based architectures.
Supported API(s):
Differentiable KMeans#
Differentiable Kmeans (DKM) is a training time palettization algorithm. The key idea of the algorithm is that in each training step, a soft kmeans cluster assignment of weight tensors is performed such that each operation in the process is differentiable. This allows for gradient updates to take place for the weights while a lookup table (LUT) of centroids and indices are learned. This is achieved by inserting palettization submodules inside a model, which simulate palettization during training using the differentiable version of the kmeans algorithm. This algorithm provides the best compressionaccuracy tradeoff across all algorithms and can be used with very low bit precisions, while still retaining good accuracy. However, this is also the most time and data intensive. Since the algorithm involves computation of the distance and the attention matrices, in practice it can also take substantial memory.
Supported API(s):
Methodology#
In the tables below, we provide accuracy benchmarks on several models, palettized using coremltools.optimize
APIs.
See Palettization Performance page to learn more about how the benchmarked models were generated.
All evaluations were performed on the final compressed (or uncompressed) CoreML models, using the validation subset of the dataset linked in Model Info. The training time compressed models were trained for three trials, starting from the same pretrained weights, and using a different ordering of data during training for each trial. For these models, we report the mean accuracy across the three trials, along with the standard deviation.
Results#
Model Name 
Config 
Optimization Algorithm 
Compression Ratio 
Accuracy 

Float16 
n/a 
1.0 
71.86 

Differentiable KMeans 
5.92 
68.81 ± 0.04 

Differentiable KMeans 
3.38 
70.60 ± 0.08 

6 bit 
KMeans 
2.54 
70.89 

8 bit 
KMeans 
1.97 
71.80 

Float16 
n/a 
1.0 
67.58 

Differentiable KMeans 
5.82 
59.82 ± 0.98 

Differentiable KMeans 
3.47 
67.23 ± 0.04 

6 bit 
KMeans 
2.6 
65.46 

8 bit 
KMeans 
1.93 
67.44 

Float16 
n/a 
1.0 
76.14 

Differentiable KMeans 
7.63 
75.47 ± 0.05 

Differentiable KMeans 
3.9 
76.63 ± 0.01 

6 bit 
KMeans 
2.65 
75.68 

8 bit 
KMeans 
1.99 
76.05 

Float16 
n/a 
1.0 
29.0 

Differentiable KMeans 
7.71 
25.66 ± 0.03 

Differentiable KMeans 
3.94 
28.14 ± 0.11 

6 bit 
KMeans 
2.65 
28.27 

8 bit 
KMeans 
2.0 
28.75 