coreai_opt.coreai_utils.quantize_weights¶
- coreai_opt.coreai_utils.quantize_weights(coreai_program, dtype, qscheme=QScheme.SYMMETRIC, granularity=CompressionGranularity.PER_CHANNEL, block_size=32, weight_num_threshold=1024, scale_dtype=None, in_place=False)[source]¶
Quantize weights in a Core AI AIProgram (MLIR<CoreAI> IR) by using Core AI ops.
Walks through the IR and quantizes each coreai.constant op that needs to be compressed. Only constants consumed by ops in
_OPS_WEIGHT_NEED_COMPRESSIONare candidates; ops that fail to be quantized are skipped with a warning.The
granularityandblock_sizeparameters determine the effectiveblock_sizesper axis (0means the full axis is one block):For a 2-D linear weight
[C_out, C_in]:|-------------------------------|--------------------------| | Granularity | block_sizes | |-------------------------------|--------------------------| | PER_TENSOR | [0, 0] | | PER_CHANNEL | [1, 0] | | PER_BLOCK(bs=32) | [1, 32] | |-------------------------------|--------------------------|
For a 4-D Conv weight
[C_out, C_in, KH, KW]:|-------------------------------|--------------------------| | Granularity | block_sizes | |-------------------------------|--------------------------| | PER_TENSOR | [0, 0, 0, 0] | | PER_CHANNEL | [1, 0, 0, 0] | | PER_BLOCK(bs=32) | [1, 32, 0, 0] | |-------------------------------|--------------------------|
- Parameters:
coreai_program (AIProgram) – The model to be quantized.
dtype (DType) – Target quantized data type (e.g.
DType.INT8,DType.INT4,DType.FP8_E4M3FN,DType.FP4_E2M1FN).qscheme (QScheme) – Quantization scheme. Use
QScheme.SYMMETRICorQScheme.ASYMMETRIC. FP dtypes only supportQScheme.SYMMETRIC. Defaults toQScheme.SYMMETRIC.granularity (CompressionGranularity) – Quantization granularity. Supports
CompressionGranularity.PER_TENSOR,CompressionGranularity.PER_CHANNEL, andCompressionGranularity.PER_BLOCK. Defaults toCompressionGranularity.PER_CHANNEL.block_size (int) – Block size applied to the input channel axis. Only effective when
granularityisCompressionGranularity.PER_BLOCK. Defaults to32.weight_num_threshold (int) – Threshold of weight element count to determine whether to compress a weight. Defaults to
1024.scale_dtype (DType | None) – Data type for the scale constants. Must be
Nonefor integerdtypevalues. Must beNoneforDType.FP4_E2M1FN(scale is always stored inDType.FP8_E8M0FNUinternally). For FP8dtypevalues,None(default) uses the uncompressed weight dtype (e.g.f16orf32) for the scale;DType.FP8_E8M0FNUstores the scale in the 8-bit E8M0FNU format (MXFP). Defaults toNone.in_place (bool) – Whether to quantize the model in-place. Defaults to
False.
- Returns:
A quantized Core AI program.
- Return type:
AIProgram
- Raises:
ValueError – If
dtypeis not in the set of supported weight dtypes.ValueError – If
dtypeis an FP dtype andqschemeisQScheme.ASYMMETRIC. FP quantization only supports symmetric mode.ValueError – If
scale_dtypeis notNonefor an integerdtype.ValueError – If
scale_dtypeis notNoneforDType.FP4_E2M1FN.ValueError – If
dtypeisDType.FP4_E2M1FNandgranularityis notCompressionGranularity.PER_BLOCKorblock_sizeis not32. FP4 weights must use per-block quantization with a block size of 32 to produce a valid MXFP4 encoding.