Quantization
- coremltools.optimize.coreml.linear_quantize_weights(*args, **kwargs)[source]
Utility function to convert a float precision MLModel of type
mlprogram
, which uses float-precision weights, into a compressed MLModel that uses n-bit weights (currently only support n=4 and n=8). This is achieved by converting the float weight values that are stored in theconst
op into theconstexpr_affine_dequantize
orconstexpr_blockwise_shift_scale
op (based on model’s minimum deployment target).This function uses linear quantization on the float weights, providing up to 4x (for 4-bit) savings in storage compared to float 16, or up to 4x savings compared to float 32. All computation at runtime uses float precision; the precision of the intermediate tensors and the compute precision of the ops are not altered.
For each weight, this utility function converts the weight into the int4/8 or uint4/8 type using either linear interpolation (
"linear"
mode) or linear symmetric interpolation ("linear_symmetric"
mode, the default).Linear interpolation
The following description uses 8-bit quantization to illustrate, and 4-bit is similar to it.
Linear interpolation (
"linear"
mode) maps the min/max of the float range to the 8-bit integer range[low, high]
using a zero point (also called quantization bias, or offset) and a scale factor. For the int8 quantization,[low, high] = [-128, 127]
, while uint8 quantization uses range[0, 255]
."linear"
mode uses the quantization formula:\[w_r = s * (w_q - z)\]Where:
\(w_r\) and \(s\) are of type float.
\(w_r`\) represents the float precision weight.
\(s\) represents the scale.
\(w_q\) and \(z\) are of type 8-bit integer.
\(w_q\) represents quantized weight.
\(z\) represents the zero point.
Quantized weights are computed as follows:
\[w_q = cast\_to\_8\_bit\_integer(w_r / s + cast\_to\_float(z))\]Note: \(cast\_to\_8\_bit\_integer\) is the process of clipping the input to range
[low, high]
followed by rounding and casting to 8-bit integer.In
"linear"
mode,s, z
are computed by mapping the original float range[A, B]
into the 8-bit integer range[-128, 127]
or[0, 255]
. That is, you are solving the following linear equations:B = s * (high - z)
A = s * (low - z)
The equations result in the following:
s = (B - A) / (high - low)
z = cast_to_8_bit_integer((low * B - high * A) / (B - A))
When the rank of weight
w
is 1, thens
andz
are both scalars. When the rank of the weight is greater than 1, thens
andz
are both vectors. In that case, scales are computed per channel, in which channel is the output dimension, which corresponds to the first dimension for ops such asconv
andlinear
, and the second dimension for theconv_transpose
op.For
"linear"
mode, \(A = min(w_r)\), \(B = max(w_r)\).Linear symmetric interpolation
With linear symmetric interpolation (
"linear_symmetric"
mode, the default), rather than mapping the exact min/max of the float range to the quantized range, the function chooses the maximum absolute value between the min/max, which results in a floating-point range that is symmetric with respect to zero. This also makes the resulting zero point0
for int8 weight and127
for uint8 weight.For
"linear_symmetric"
mode:\(A = -R\) and \(B = R\), where \(R = max(abs(w_r))\).
This function maps to the range of
[-127, 127]
for int8 weight and[0, 254]
for uint8 weight.The result is
s=(B-A)/254
->s=2R/254
->s=R/127
.- Solving for
z
: int8:
z = (-127 * R + 127 * R)/2R
->z=0
.uint8:
z = (0 * R + 254 * R)/2R
->z=127
.
- Solving for
- Parameters:
- mlmodel: MLModel
Model to be quantized. This MLModel should be of type
mlprogram
.- config: OptimizationConfig
An
OptimizationConfig
object that specifies the parameters for weight quantization.- joint_compression: bool
When it is set, the input mlmodel (should already be compressed) is further quantized to a jointly compressed mlmodel. For what compression schema that could be futher jointly quantized, see the blockwise_quantize_weights graph pass for details.
Using “palettize + quantize” as an example, where the input mlmodel is already palettized, and the palettization’s lut will be further quantized. The weight values are represented by
constexpr_blockwise_shift_scale
+constexpr_lut_to_dense
ops: lut(int8) -> constexpr_blockwise_shift_scale -> lut(fp16) -> constexpr_lut_to_dense -> dense(fp16)
- Returns:
- model: MLModel
The quantized MLModel instance.
Examples
import coremltools as ct import coremltools.optimize as cto model = ct.coreml.models.MLModel("my_model.mlpackage") config = cto.coreml.OptimizationConfig( global_config=cto.coreml.OpLinearQuantizerConfig(mode="linear_symmetric") ) compressed_model = cto.coreml.linear_quantize_weights(model, config)
- coremltools.optimize.coreml.experimental.linear_quantize_activations(mlmodel: MLModel, config: OptimizationConfig, sample_data: List)[source]
Utility function to convert a float precision MLModel of type
mlprogram
, which uses float-precision activations, into a compressed MLModel that uses n-bit activations (currently only support n=8).This is achieved by calibrating the float activation values that observed by feeding real sample data into the model, converting calibrated statistics into the
quantize
anddequantize
op pairs, and inserted into where activation get quantized.It’s recommended to use with linear_quantize_weights for 8-bit activation and 8-bit weight linear quantization. It’s also compatible to use with other weight compression methods.
- Parameters:
- mlmodel: MLModel
Model to be quantized. This MLModel should be of type
mlprogram
.- config: OptimizationConfig
An
OptimizationConfig
object that specifies the parameters for activation quantization.- sample_data: List
Data used to characterize statistics of the activation values of the original float precision model. Expecting a list of sample input dictionaries.
- Returns:
- model: MLModel
The activation quantized MLModel instance.
Examples
import coremltools as ct import coremltools.optimize as cto model = ct.coreml.models.MLModel("my_model.mlpackage") activation_config = cto.coreml.OptimizationConfig( global_config=cto.coreml.experimental.OpActivationLinearQuantizerConfig( mode="linear_symmetric" ) ) compressed_model_a8 = cto.coreml.experimental.linear_quantize_activations( model, activation_config, sample_data ) # (Optional) It's recommended to use with linear_quantize_weights. weight_config = cto.coreml.OptimizationConfig( global_config=cto.OpLinearQuantizerConfig(mode="linear_symmetric") ) compressed_model_w8a8 = cto.linear_quantize_weights(compressed_model_a8, weight_config)
- class coremltools.optimize.coreml.OpLinearQuantizerConfig(mode: str = 'linear_symmetric', dtype: str | type = <class 'coremltools.converters.mil.mil.types.type_int.make_int.<locals>.int'>, granularity: str | ~coremltools.optimize.coreml._config.CompressionGranularity = CompressionGranularity.PER_CHANNEL, block_size: int | ~typing.List[int] | ~typing.Tuple[int, ...] = 32, weight_threshold: int | None = 2048)[source]
- Parameters:
- mode: str
Mode for linear quantization:
"linear_symmetric"
(default): Input data are quantized in the range[-R, R]
, where \(R = max(abs(w_r))\)."linear"
: Input data are quantized in the range \([min(w_r), max(w_r)]\).
- dtype: str or np.generic or mil.type
Determines the quantized data type (int8/uint8/int4/uint4).
- The allowed values are:
np.int8
(the default)np.uint8
coremltools.converters.mil.mil.types.int8
coremltools.converters.mil.mil.types.uint8
coremltools.converters.mil.mil.types.int4
coremltools.converters.mil.mil.types.uint4
strings to specify dtype such as “int4”, “uint4”, etc
- granularity: str
Granularity for quantization.
"per_tensor"
"per_channel"
(default)"per_block"
- block_size: int or List/Tuple of int
Only effective when granularity is set to “per_block”.
Determines size of the block, where all elements in a block share the same scale and zero_point.
If it’s int, the block size on each axis is auto determined for best performance. More specifially, the block will have
block_size
on input axis and1
on output axis, where input/output axis is auto picked based on op type. For example, if weight has shape [Cout, Cin], the block will have shape [1, block_size]; If the weight has shape [C_out, C_in, KH, KW], the block will has shape [1, block_size, KH, KW].If it’s a tuple of int, it must have the same rank as the weight, which specify the block size on each axis.
The value 0 means block size equal to dim size at the corresponding axis.
If the dim size on any axis is not divisible by the corresponding block size, the op will be skipped.
The tuple input of
block_size
provides users fully control about the block. Here are some examples about how different granularities could be achieved:Given the weight of a 2D Conv which has shape [C_out, C_in, KH, KW]: |------------------------|————————–|---------------------------|—————————-| | Granularity | output_channel_block_size| input_channel_block_size | Weight Shape of Each Block | |------------------------|————————–|---------------------------|—————————-| | Per Tensor | 0 | 0 | [C_out, C_in, KH, KW] | | Per Input Channel | 0 | 1 | [C_out, 1, KH, KW] | | Per Output Channel | 1 | 0 | [1, C_in, KH, KW] | | Per Block | 1 | 32 | [1, 32, KH, KW] | |------------------------|————————–|---------------------------|—————————-|
Given the weight of a linear layer which has shape [C_out, C_in]: |------------------------|————————–|---------------------------|—————————-| | Granularity | output_channel_block_size| input_channel_block_size | Weight Shape of Each Block | |------------------------|————————–|---------------------------|—————————-| | Per Tensor | 0 | 0 | [C_out, C_in] | | Per Input Channel | 0 | 1 | [C_out, 1] | | Per Output Channel | 1 | 0 | [1, C_in] | | Per Block | 1 | 32 | [1, 32] | |------------------------|————————–|---------------------------|—————————-|
Given the weight of matmul’s y (transpose_y=False) which has shape […, C_in, C_out]: |------------------------|————————–|---------------------------|—————————-| | Granularity | output_channel_block_size| input_channel_block_size | Weight Shape of Each Block | |------------------------|————————–|---------------------------|—————————-| | Per Tensor | 0 | 0 | […, C_in, C_out] | | Per Input Channel | 0 | 1 | […, 1, C_out] | | Per Output Channel | 1 | 0 | […, C_in, 1] | | Per Block | 1 | 32 | […, 32, 1] | |------------------------|————————–|---------------------------|—————————-|
- weight_threshold: int
The size threshold, above which weights are pruned. That is, a weight tensor is pruned only if its total number of elements are greater than
weight_threshold
. Default to 2048.For example, if
weight_threshold = 1024
and a weight tensor is of shape[10, 20, 1, 1]
, hence200
elements, it will not be pruned.