Post-Training Compression
- coremltools.optimize.coreml.linear_quantize_weights(*args, **kwargs)[source]
Utility function to convert a float precision MLModel of type
mlprogram
, which uses float-precision weights, into a compressed MLModel that uses n-bit weights (currently only support n=4 and n=8). This is achieved by converting the float weight values that are stored in theconst
op into theconstexpr_affine_dequantize
orconstexpr_blockwise_shift_scale
op (based on model’s minimum deployment target).This function uses linear quantization on the float weights, providing up to 4x (for 4-bit) savings in storage compared to float 16, or up to 4x savings compared to float 32. All computation at runtime uses float precision; the precision of the intermediate tensors and the compute precision of the ops are not altered.
For each weight, this utility function converts the weight into the int4/8 or uint4/8 type using either linear interpolation (
"linear"
mode) or linear symmetric interpolation ("linear_symmetric"
mode, the default).Linear interpolation
The following description uses 8-bit quantization to illustrate, and 4-bit is similar to it.
Linear interpolation (
"linear"
mode) maps the min/max of the float range to the 8-bit integer range[low, high]
using a zero point (also called quantization bias, or offset) and a scale factor. For the int8 quantization,[low, high] = [-128, 127]
, while uint8 quantization uses range[0, 255]
."linear"
mode uses the quantization formula:\[w_r = s * (w_q - z)\]Where:
\(w_r\) and \(s\) are of type float.
\(w_r`\) represents the float precision weight.
\(s\) represents the scale.
\(w_q\) and \(z\) are of type 8-bit integer.
\(w_q\) represents quantized weight.
\(z\) represents the zero point.
Quantized weights are computed as follows:
\[w_q = cast\_to\_8\_bit\_integer(w_r / s + cast\_to\_float(z))\]Note: \(cast\_to\_8\_bit\_integer\) is the process of clipping the input to range
[low, high]
followed by rounding and casting to 8-bit integer.In
"linear"
mode,s, z
are computed by mapping the original float range[A, B]
into the 8-bit integer range[-128, 127]
or[0, 255]
. That is, you are solving the following linear equations:B = s * (high - z)
A = s * (low - z)
The equations result in the following:
s = (B - A) / (high - low)
z = cast_to_8_bit_integer((low * B - high * A) / (B - A))
When the rank of weight
w
is 1, thens
andz
are both scalars. When the rank of the weight is greater than 1, thens
andz
are both vectors. In that case, scales are computed per channel, in which channel is the output dimension, which corresponds to the first dimension for ops such asconv
andlinear
, and the second dimension for theconv_transpose
op.For
"linear"
mode, \(A = min(w_r)\), \(B = max(w_r)\).Linear symmetric interpolation
With linear symmetric interpolation (
"linear_symmetric"
mode, the default), rather than mapping the exact min/max of the float range to the quantized range, the function chooses the maximum absolute value between the min/max, which results in a floating-point range that is symmetric with respect to zero. This also makes the resulting zero point0
for int8 weight and127
for uint8 weight.For
"linear_symmetric"
mode:\(A = -R\) and \(B = R\), where \(R = max(abs(w_r))\).
This function maps to the range of
[-127, 127]
for int8 weight and[0, 254]
for uint8 weight.The result is
s=(B-A)/254
->s=2R/254
->s=R/127
.- Solving for
z
: int8:
z = (-127 * R + 127 * R)/2R
->z=0
.uint8:
z = (0 * R + 254 * R)/2R
->z=127
.
- Solving for
- Parameters:
- mlmodel: MLModel
Model to be quantized. This MLModel should be of type
mlprogram
.- config: OptimizationConfig
An
OptimizationConfig
object that specifies the parameters for weight quantization.- joint_compression: bool
When it is set, the input mlmodel (should already be compressed) is further quantized to a jointly compressed mlmodel. For what compression schema that could be futher jointly quantized, see the blockwise_quantize_weights graph pass for details.
Using “palettize + quantize” as an example, where the input mlmodel is already palettized, and the palettization’s lut will be further quantized. The weight values are represented by
constexpr_blockwise_shift_scale
+constexpr_lut_to_dense
ops: lut(int8) -> constexpr_blockwise_shift_scale -> lut(fp16) -> constexpr_lut_to_dense -> dense(fp16)
- Returns:
- model: MLModel
The quantized MLModel instance.
Examples
import coremltools as ct import coremltools.optimize as cto model = ct.coreml.models.MLModel("my_model.mlpackage") config = cto.coreml.OptimizationConfig( global_config=cto.coreml.OpLinearQuantizerConfig(mode="linear_symmetric") ) compressed_model = cto.coreml.linear_quantize_weights(model, config)
- coremltools.optimize.coreml.prune_weights(*args, **kwargs)[source]
Utility function to convert a float precision MLModel of type
mlprogram
to a compressed MLModel using sparse representation. Theconst
ops storing weight values are replaced byconstexpr_sparse_to_dense
ops.This function is useful if the model is trained with pruning techniques so that a lot of weights have zero values. If a large percentage of weight values are zero, a sparse representation is more efficient than a dense one (the default).
The sparsified weights are stored in a bit mask. If the weight values are
{0, 0, 0, 0, 0, 0, 0, 56.3}
, its sparse representation contains a bit mask with ones on locations where the value is non-zero:00000001b
. This is accompanied by non-zero data, which is a size-1 vector of value{56.3}
.For example, given the following:
weight = [0.3, 0, 0, 0.5, 0, 0]
non_zero_data, bit_mask = sparsify(weight)
The indices of the non-zero elements are:
non_zero_data = [0.3, 0.5]
bit_mask = "100100"
- Parameters:
- mlmodel: MLModel
Model to be sparsified. This MLModel should be of type
mlprogram
.- config: OptimizationConfig
An
OptimizationConfig
object that specifies the parameters for weight pruning.- joint_compression: bool
When it is set, the input mlmodel (should already be compressed) is further pruned to a jointly compressed mlmodel. For what compression schema that could be futher jointly pruned, see the prune_weights graph pass for details.
Using “quantize + prune” as an example, where the input mlmodel is already quantized, and it will be further pruned. The weight values are represented by
constexpr_sparse_blockwise_shift_scale
+constexpr_sparse_to_dense
ops: quantized(sparse) -> constexpr_sparse_blockwise_shift_scale -> weight(sparse) -> constexpr_sparse_to_dense -> weight(dense)
- Returns:
- model: MLModel
The sparse MLModel instance.
Examples
import coremltools as ct import coremltools.optimize as cto model = ct.models.MLModel("my_model.mlpackage") config = cto.coreml.OptimizationConfig( global_config=cto.coreml.OpThresholdPrunerConfig(threshold=1e-12) ) compressed_model = cto.coreml.prune_weights(model, config)
- coremltools.optimize.coreml.palettize_weights(*args, **kwargs)[source]
Utility function to convert a float precision MLModel of type
mlprogram
to a compressed MLModel by reducing the overall number of weights using one or more look-up-table (LUT). A LUT contains a list of float values. An nbit LUT has 2^{nbits} entries.For example, a float weight vector such as
{0.3, 0.3, 0.5, 0.5}
can be compressed using a 1-bit LUT:{0.3, 0.5}
. In this case the float vector can be replaced with a 1-bit vector{0, 0, 1, 1}
.This function iterates over all the weights in the
mlprogram
, discretizes its values, and constructs the LUT according to the algorithm specified inmode
. The float values are then converted to the nbit values, and the LUT is saved alongside each weight. Theconst
ops storing weight values are replaced byconstexpr_lut_to_dense
ops.At runtime, the LUT and the nbit values are used to reconstruct the float weight values, which are then used to perform the float operation the weight is feeding into.
Consider the following example of
"uniform"
mode (a linear histogram):nbits = 4
mode = "uniform"
weight = [0.11, 0.19, 0.3, 0.08, 0.0, 0.02]
The weight can be converted to a palette with indices
[0, 1, 2, 3]
(2 bits). The indices are a byte array.The data range
[0.0, 0.3]
is divided into 4 partitions linearly, which is[0.0, 0.1, 0.2, 0.3]
.The LUT would be
[0.0, 0.1, 0.2, 0.3]
.The weight is rounded to
[0.1, 0.2, 0.3, 0.1, 0.0, 0.0]
, and represented in the palette as indices[01b, 10b, 11b, 01b, 00b, 00b]
.
- Parameters:
- mlmodel: MLModel
Model to be converted by a LUT. This MLModel should be of type
mlprogram
.- config: OptimizationConfig
An
OptimizationConfig
object that specifies the parameters for weight palettization.- joint_compression: bool
When it is set, the input mlmodel (should already be compressed) is further palettized to a jointly compressed mlmodel. For what compression schema that could be futher jointly palettized, see the channelwise_palettize_weights graph pass for details.
Using “prune + palettize” as an example, where the input mlmodel is already pruned, and the non-zero entries will be further palettized. The weight values are represented by
constexpr_lut_to_sparse
+constexpr_sparse_to_dense
ops: lut(sparse) -> constexpr_lut_to_sparse -> weight(sparse) -> constexpr_sparse_to_dense -> weight(dense)
- Returns:
- model: MLModel
The palettized MLModel instance.
Examples
import coremltools as ct import coremltools.optimize as cto model = ct.models.MLModel("my_model.mlpackage") config = cto.coreml.OptimizationConfig( global_config=cto.coreml.OpPalettizerConfig(mode="kmeans", nbits=4) ) compressed_model = cto.coreml.palettize_weights(model, config)
- coremltools.optimize.coreml.decompress_weights(*args, **kwargs)[source]
Utility function to convert weights that are sparse or palettized or affine quantized, back to the float format. That is, convert any of the following three ops to
mb.const
:constexpr_affine_dequantize
constexpr_lut_to_dense
constexpr_sparse_to_dense
- Parameters:
- mlmodel: MLModel
Model which will be decompressed.
- Returns:
- model: MLModel
The MLModel with no
constexpr
ops included.
Examples
import coremltools as ct model = ct.models.MLModel("my_compressed_model.mlpackage") decompressed_model = ct.optimize.coreml.decompress_weights(model)
- coremltools.optimize.coreml.get_weights_metadata(*args, **kwargs)[source]
Utility function to get the weights metadata as a dictionary, which maps the weight’s name to its corresponding CoreMLWeightMetaData.
CoreMLWeightMetaData contains the following attributes:
val
: The weight data.sparsity
: the percentile of the element whose absolute value<= 1e-12
.unique_values
: number of unique values in the weight.child_ops
: meta information of the child ops in which the weight is feeding into.
- Parameters:
- mlmodel: MLModel
Model in which the weight metadata is retrieved from.
- weight_threshold: int
The size threshold, above which weights are returned. That is, a weight tensor is included in the resulting dictionary only if its total number of elements are greater than
weight_threshold
. For example, ifweight_threshold = 1024
and a weight tensor is of shape[10, 20, 1, 1]
, hence200
elements, it will not be returned by theget_weights_metadata
API.If not provided, it will be set to
2048
, in which weights bigger than2048
elements are returned.
- Returns:
- dict[str, CoreMLWeightMetaData]
A dict that maps weight’s name to its metadata.
Examples
In this example, there are two weights whose sizes are greater than
2048
. A weight namedconv_1_weight
is feeding into aconv
op namedconv_1
, while another weight namedlinear_1_weight
is feeding into alinear
op namedlinear_1
. You can access the metadata byweight_metadata_dict["conv_1_weight"]
, and so on.import coremltools as ct mlmodel = ct.models.MLModel("my_model.mlpackage") weight_metadata_dict = ct.optimize.coreml.get_weights_metadata( mlmodel, weight_threshold=2048 ) # get the weight names with size > 25600 large_weights = [] for k, v in weight_metadata_dict.items(): if v.val.size >= 25600: large_weights.append(k) # get the weight names with sparsity >= 50% sparse_weights = [] for k, v in weight_metadata_dict.items(): if v.sparsity >= 0.5: sparse_weights.append(k) # get the weight names with unique elements <= 16 palettized_weights = [] for k, v in weight_metadata_dict.items(): if v.unique_values <= 16: palettized_weights.append(k) # print out the dictionary print(weight_metadata_dict)
The output from the above example would be:
conv_1_weight [ val: np.ndarray(shape=(32, 64, 2, 2), dtype=float32) sparsity: 0.5 unique_values: 4097 child_ops: [ conv(name=conv_1, weight=conv_1_weight, ...) ] ] linear_1_weight [ val: np.ndarray(shape=(128, 64), dtype=float32) sparsity: 0.2501220703125 unique_values: 4 child_ops: [ linear(name=linear_1, weight=linear_1_weight, ...) ] ]
- class coremltools.optimize.coreml.CoreMLWeightMetaData(val: ndarray, sparsity: float | None = NOTHING, unique_values: int | None = NOTHING, child_ops: List[CoreMLOpMetaData] | None = None)[source]
A container class that stores weight meta data.
The class has the following attributes:
- Parameters:
- val: numpy.ndarray
The weight data.
- sparsity: float
The percentile of the element whose absolute value
<= 1e-12
.- unique_values: int
Number of unique values in the weight.
- child_ops: list[CoreMLOpMetaData]
A list of
CoreMLOpMetaData
which contains information of child ops in which the weight is feeding into.The attributes can be accessed by:
child_ops[idx].op_type
: The operation type of theidx
‘th child op.child_ops[idx].name
: The name of theidx
‘th child op.Other op-dependant attributes also can be accessed. For instance, if
idx
‘th child op is aconv
layer,child_ops[idx].weight
will return its weight name.For more details, please refer to the
CoreMLOpMetaData
doc string.
Examples
import numpy as np from coremltools.optimize.coreml import CoreMLWeightMetaData data = np.array([[1.0, 0.0], [0.0, 6.0]], dtype=np.float32) meta_data = CoreMLWeightMetaData(data) print(meta_data)
Outputs:
[ val: np.ndarray(shape=(2, 2), dtype=float32) sparsity: 0.5 unique_values: 3 ]
- class coremltools.optimize.coreml.CoreMLOpMetaData(op_type: str, name: str, params_name_mapping: Dict[str, str])[source]
A container class that stores op meta data.
The class has the following attributes:
- Parameters:
- op_type: str
The type of the op. For instance:
conv
,linear
, and so on.- name: str
The name of the op.
- params_name_mapping: dict[str, str]
A dict that maps the op’s constant parameters to its corresponding weight name. For instance, given a
conv
op withparams_name_mapping
,{ "weight": "conv_1_weight", "bias": "conv_1_bias", }
means that the weight and bias of this op are named
conv_1_weight
,conv_1_bias
, respectively.