Training-Time Palettization
Palettization is a mechanism for compressing a model by clustering the model’s float weights into a look-up table (LUT) of centroids and indices.
Palettization is implemented as an extension of PyTorch’s QAT APIs. It works by inserting palettization layers in appropriate places inside a model. The model can then be fine-tuned to learn the new palettized layers’ weights in the form of a LUT and indices.
Palettizer
- class coremltools.optimize.torch.palettization.ModuleDKMPalettizerConfig(n_bits: Optional[int] = None, weight_threshold: int = 2048, milestone: int = 0, cluster_dim: Optional[int] = None, quant_min: int = -128, quant_max: int = 127, dtype: Union[str, dtype] = torch.qint8, cluster_dtype: str = 'f32', quantize_activations: bool = False, partition_size: int = 2000000000, cluster_permute: Optional[tuple] = None, palett_max_mem: float = 1.0, kmeans_max_iter: int = 3, prune_threshold: float = 0.0, kmeans_init: str = 'cpu.kmeans++', kmeans_opt1d_threshold: int = 1024, enforce_zero: bool = False, palett_mode: str = 'dkm', palett_tau: float = 0.0001, palett_epsilon: float = 0.0001, palett_lambda: float = 0.0, add_extra_centroid: bool = False, palett_cluster_tol: float = 0.05)[source]
Configuration class for specifying global and module-level options for palettization algorithm implemented in
DKMPalettizer
.The parameters specified in this config control the DKM algorithm, described in DKM: Differentiable K-Means Clustering Layer for Neural Network Compression.
For most use cases, the only parameters you need to specify are
n_bits
,weight_threshold
, andmilestone
.Note
Most of the parameters in this class are meant for advanced use cases and for further fine-tuning the DKM algorithm. The default values usually work for a majority of tasks.
Note
Change the following parameters only when you use activation quantization in conjunction with DKM weight palettization:
quant_min
,quant_max
,dtype
, andquantize_activations
.- Parameters:
n_bits (
int
) – Number of clusters. The number of clusters used is \(2^{n\_bits}\). Defaults to4
for linear layers and2
for all other layers.weight_threshold (
int
) – A module is only palettized if the number of elements in its weight matrix exceedsweight_threshold
. Defaults to2048
.milestone (
int
) – Step or epoch at which palettization begins. Defaults to0
.cluster_dim (
int
,optional
) – The dimension of each cluster. Defaults to1
.quant_min – (
int
,optional
): The minimum value for each element in the weight clusters if they are quantized. Defaults to-128
.quant_max – (
int
,optional
): The maximum value for each element in the weight clusters if they are quantized. Defaults to127
dtype (
torch.dtype
,optional
) – Thedtype
to use for quantizing the activations. Only applies whenquantize_activations
isTrue
. Defaults totorch.qint8
.cluster_dtype (
str
,optional
) –dtype
to use for quantizing the clusters. Allowed options are'i8'
,'u8'
,'f16'
,'bf16'
,'f32'
. Defaults to'f32'
, i.e., by default, the clusters aren’t quantized.quantize_activations (
bool
,optional
) – WhenTrue
, the activation are quantized. Defaults toFalse
.partition_size (
int
,optional
) – partition_size helps in per channel palettization. Defaults to2000000000
.cluster_permute (
tuple
,optional
) – Permutation order to apply to weight partitions. Defaults toNone
.palett_max_mem (
float
,optional
) – Proportion of available GPU memory that should be used for palettization. Defaults to1.0
.kmeans_max_iter (
int
,optional
) – Maximum number of differentiablek-means
iterations. Defaults to3
.prune_threshold (
float
,optional
) – Hard-shrinks weights between [-prune_threshold
,prune_threshold
] to zero. Useful for joint pruning and palettization. Defaults to0.0
.kmeans_init (
str
,optional
) –k-means
algorithm to use. Allowed options areefficient_kmeans
,cpu.kmeans++
andkmeans_pp
. Defaults tocpu.kmeans++
.kmeans_opt1d_threshold (
int
,optional
) – Channel threshold to decide ifopt1d kmeans
should be used. Defaults to1024
.enforce_zero (
bool
,optional
) – IfTrue
, enforces the LUT centroid which is closest to the origin to be fixed to zero. Defaults toFalse
.palett_mode (
str
,optional
) – Criteria to calculate attention duringk-means
. Allowed options aregsm
,dkm
andhard
. Defaults todkm
.palett_tau (
float
,optional
) – Temperature factor for softmax used in DKM algorithm. Defaults to0.0001
.palett_epsilon (
float
,optional
) – Distance threshold for clusters betweenk-means
iterations. Defaults to0.0001
.palett_lambda (
float
,optional
) – Reduces effects of outliers during centroid calculation. Defaults to0.0
.add_extra_centroid (
bool
,optional
) – IfTrue
, adds an extra centroid to the LUT. Defaults toFalse
.palett_cluster_tol (
float
,optional
) – Tolerance for non-unique centroids in the LUT. The higher the number, the more tolerance for non-unique centroids. Defaults to0.05
.
- as_dict() Dict[str, Any]
Returns the config as a dictionary.
- classmethod from_dict(config_dict: Dict[str, Any]) ModuleOptimizationConfig
Create class from a dictionary of string keys and values.
- Parameters:
config_dict (
dict
ofstr
and values) – A nested dictionary of strings and values.
- classmethod from_yaml(yml: Union[IO, str]) ModuleOptimizationConfig
Create class from a yaml stream.
- Parameters:
yml – An
IO
stream containing yaml or astr
path to the yaml file.
- class coremltools.optimize.torch.palettization.DKMPalettizerConfig(global_config: Optional[GlobalConfigType] = None, module_type_configs: ModuleTypeConfigType = _Nothing.NOTHING, module_name_configs: ModuleNameConfigType = _Nothing.NOTHING)[source]
Configuration for specifying how different submodules of a model are palettized by
DKMPalettizer
.The
module_type_configs
parameter can accept a list ofModuleDKMPalettizerConfig
as values for a given module type. The list can specify different parameters for differentweight_threshold
values. This is useful if you want to apply different configs to layers of the same type with weights of different sizes.For example, to use
4
-bit palettization for weights with more than1000
elements and2
-bit palettization for weights with more than300
but less than1000
elements, create a config as follows:custom_config = { nn.Conv2d: [ {"n_bits": 4, "cluster_dim": 4, "weight_threshold": 1000}, {"n_bits": 2, "cluster_dim": 2, "weight_threshold": 300}, ] } config = DKMPalettizerConfig.from_dict({"module_type_configs": custom_config})
- Parameters:
global_config (
ModuleDKMPalettizerConfig
) – Config to be applied globally to all supported modules. Missing values are chosen from the default config.module_type_configs (
dict
ofstr
toModuleDKMPalettizerConfig
) – Module type level configs applied to a specific module class, such astorch.nn.Linear
. The keys can be either strings or module classes. Whenmodule_type_config
is set toNone
for a module type, it is not palettized.module_name_configs (
dict
ofstr
toModuleDKMPalettizerConfig
) – Module level configs applied to specific modules. The name of the module must be a fully qualified name that can be used to fetch it from the top level module using themodule.get_submodule(target)
method. Whenmodule_name_config
is set toNone
for a module, it is not palettized.
- as_dict() Dict[str, Any]
Returns the config as a dictionary.
- classmethod from_dict(config_dict: Dict[str, Any]) DKMPalettizerConfig [source]
Create class from a dictionary of string keys and values.
- Parameters:
config_dict (
dict
ofstr
and values) – A nested dictionary of strings and values.
- classmethod from_yaml(yml: Union[IO, str]) OptimizationConfig
Create class from a yaml stream.
- Parameters:
yml – An
IO
stream containing yaml or astr
path to the yaml file.
- set_global(global_config: Optional[ModuleOptimizationConfig]) OptimizationConfig
Set the global config.
- set_module_name(module_name: str, opt_config: Optional[ModuleOptimizationConfig]) OptimizationConfig
Set the module level optimization config for a given module instance. If the module level optimization config for an existing module was already set, the new config will override the old one.
- set_module_type(object_type: Union[Callable, str], opt_config: Optional[ModuleOptimizationConfig]) OptimizationConfig
Set the module level optimization config for a given module type. If the module level optimization config for an existing module type was already set, the new config will override the old one.
- class coremltools.optimize.torch.palettization.DKMPalettizer(model: Module, config: Optional[DKMPalettizerConfig] = None)[source]
A palettization algorithm based on “DKM: Differentiable K-Means Clustering Layer for Neural Network Compression”. It clusters the weights using a differentiable version of
k-means
, allowing the look-up-table (LUT) and indices of palettized weights to be learnt using a gradient-based optimization algorithm such as SGD.Example
import torch from coremltools.optimize.torch.palettization import ( DKMPalettizer, DKMPalettizerConfig, ModuleDKMPalettizerConfig, ) # code that defines the pytorch model, loss and optimizer. model, loss_fn, optimizer = create_model_loss_and_optimizer() # initialize the palettizer config = DKMPalettizerConfig(global_config=ModuleDKMPalettizerConfig(n_bits=4)) palettizer = DKMPalettizer(model, config) # prepare the model to insert FakePalettize layers for palettization model = palettizer.prepare(inplace=True) # use palettizer in your PyTorch training loop for inputs, labels in data: output = model(inputs) loss = loss_fn(output, labels) loss.backward() optimizer.step() palettizer.step() # fold LUT and indices into weights model = palettizer.finalize(inplace=True)
- Parameters:
model (
torch.nn.Module
) – Model on which the palettizer will act.config (
DKMPalettizerConfig
) – Config which specifies how different submodules in the model will be configured for palettization. Default config is used when passed asNone
.
- finalize(model: Optional[Module] = None, inplace: bool = False) Module [source]
Removes
FakePalettize
layers from a model and creates new model weights from theLUT
andindices
buffers.This function is called to prepare a palettized model for export using coremltools.
- Parameters:
model (
nn.Module
) – model to finalize.inplace (
bool
) – IfTrue
, model transformations are carried out in-place and the original module is mutated; otherwise, a copy of the model is mutated and returned.
- prepare(inplace: bool = False) Module [source]
Prepares a model for palettization aware training by inserting
FakePalettize
layers in appropriate places as specified by the config.- Parameters:
inplace (
bool
) – IfTrue
, model transformations are carried out in-place and the original module is mutated, otherwise a copy of the model is mutated and returned.
- report() _Report [source]
Returns a dictionary with important statistics related to current state of palettization. Each key in the dictionary corresponds to a module name, and the value is a dictionary containing the statistics, such as number of clusters and cluster dimension, number of parameters, and so on.
Palettization Layers
- class coremltools.optimize.torch.palettization.FakePalettize(observer: ObserverBase, n_bits: int, cluster_dim: int, quant_min: int = -128, quant_max: int = 127, cluster_dtype: str = 'f32', advanced_options: dict = {}, **observer_kwargs)[source]
A class that implements palettization algorithm described in DKM: Differentiable K-Means Clustering Layer for Neural Network Compression. It clusters the weights using a differentiable version of
k-means
, allowing the look-up-table (LUT) and indices of palettized weights to be learnt using a gradient-based optimization algorithm such as SGD.Extends
torch.quantization.FakeQuantize
to add support for palettization.Example
from collections import OrderedDict import torch import torch.nn as nn import coremltools.optimize.torch.palettization as palett model = nn.Sequential( OrderedDict( [ ("linear1", nn.Linear(4, 5)), ("sigmoid1", nn.Sigmoid()), ("linear2", nn.Linear(5, 4)), ("sigmoid2", nn.Sigmoid), ] ) ) fq_activation = nn.Identity fq_weight = palett.FakePalettize.with_args( observer=torch.quantization.MovingAveragePerChannelMinMaxObserver.with_args( quant_min=-128, quant_max=127, dtype=torch.qint8 ), n_bits=2, cluster_dim=1, ) model.linear2.qconfig = torch.quantization.QConfig( activation=fq_activation, weight=fq_weight ) palettized_model = palett.prepare_palettizer(model) train_model(palettized_model) palettized_converted_model = palett.finalize(palettized_model)
- Parameters:
observer (
torch.ao.quantization.observer.ObserverBase
) – Observer for quantizing theLUT
.n_bits (
int
) – Number of palettization bits. There would be \(2^{n\_bits}\) unique weights in theLUT
.cluster_dim (
int
) – Dimensionality of centroids to use for clustering.quant_min (
int
) – The minimum allowable quantized value.quant_max (
int
) – The maximum allowable quantized value.cluster_dtype (
str
) – String that decides whether to quantize theLUT
or not. The following are thestr
LUT quantization combinations: (u8
,uint8
), (i8
,int8
), and (f16
,float16
).advanced_options (
dict
) – Advanced options to configure the palettization algorithm.observer_kwargs (optional) – Arguments for the observer module.
Note
Allowed keys for
advanced_options
are the parameters listed asoptional
inModuleDKMPalettizerConfig
, besides the ones already covered by other parameters in this class.