Joint Compression¶

Combining weight palettization with activation quantization¶

Palettization and Quantization can be applied together to compress both weights and activations in a single model. The workflow uses the same KMeansPalettizer and Quantizer APIs covered in previous sections, applied sequentially in a specific order: palettize weights first, then quantize activations on the palettized model.

When combining palettization with activation quantization, the lookup table (LUT) entries should also be quantized to INT8 via lut_qspec. A floating-point LUT causes operations to execute in floating-point regardless of the activation quantization, whereas an INT8 LUT allows the runtime to use the faster W_INT8-A_INT8 execution path where available.

For quantizing activations, graph-mode quantization is used (the default).

Note: Models compressed via the joint compression flow can currently only be finalized to the Core AI backend.

Joint Compression Workflow¶

Step 1: Palettize weights¶

Configure 4-bit palettization and run prepare to install fake-palettization parametrizations on the model weights. INT8 LUT quantization is specified using the lut_qspec argument in PalettizationSpec.

import torch
import coreai_opt as opt
from coreai_opt.palettization import (
    KMeansPalettizer,
    KMeansPalettizerConfig,
    ModuleKMeansPalettizerConfig,
    PalettizationSpec,
)
from coreai_opt.quantization import QuantizationSpec
from coreai_opt.quantization.spec import QuantizationScheme

lut_qspec = QuantizationSpec(dtype=torch.int8, qscheme=QuantizationScheme.SYMMETRIC)
palett_config = KMeansPalettizerConfig(
    global_config=ModuleKMeansPalettizerConfig(
        op_state_spec={"weight": PalettizationSpec(n_bits=4, lut_qspec=lut_qspec)},
    ),
)

palettizer = KMeansPalettizer(model, palett_config)
palettizer.prepare(example_inputs)

Step 2: Finalize the palettizer¶

Call finalize to replace the FakePalettize parametrizations with a torch.export-compatible representation. This must happen before activation quantization is applied because quantizer.prepare uses torch.export, which cannot trace through the parametrizations.

palettized_model = palettizer.finalize(backend=opt.ExportBackend.CoreAI)

Step 3: Configure and prepare activation quantization¶

Apply the Quantizer to the already-palettized model. Set op_state_spec=None to disable weight quantization — weights are already compressed via palettization, so applying quantization on top would be redundant. Use a representative data sample for example_inputs to provide a reasonable starting point for activation quantization parameters.

from coreai_opt.quantization import (
    ModuleQuantizerConfig,
    Quantizer,
    QuantizerConfig,
)

act_spec = QuantizationSpec(dtype=torch.int8, qscheme=QuantizationScheme.SYMMETRIC)
quant_config = QuantizerConfig(
    global_config=ModuleQuantizerConfig(
        op_state_spec=None,
        op_input_spec={"*": act_spec},
        op_output_spec={"*": act_spec},
    ),
)

quantizer = Quantizer(palettized_model, quant_config)
prepared_model = quantizer.prepare(example_inputs)

Step 4: Calibrate¶

Run representative data through the prepared model inside calibration_mode to collect activation statistics used to compute quantization scales.

with quantizer.calibration_mode():
    for batch in calibration_dataloader:
        prepared_model(batch)

Step 5: Finalize¶

Call quantizer.finalize to convert fake-quantization ops into backend-specific representations. The model is then ready to be exported for downstream conversion with coreai-torch. Refer to Integration with Core AI for more details.

final_model = quantizer.finalize(backend=opt.ExportBackend.CoreAI)

Notes¶

For a working end-to-end example, see test_p4a8_compression_mnist_accuracy in tests/test_joint_compression.py.
For export-related tests, see test_mnist_p4a8_compression_export in tests/export/test_pt2e_mlir_export.py.
We explore applying joint compression to the EDSR model here.