API Overview

Weight-Only Quantization (Data-Free PTQ)

Weight-only quantization compresses the model’s weight tensors to a lower precision while keeping activations in their original precision.

import coreai_opt as opt
from coreai_opt.quantization import Quantizer, QuantizerConfig
import torch

model = MyModel().eval()
example_inputs = (torch.randn(1, 3, 224, 224),)

# A built-in preset: INT8, symmetric, per-channel weight-only quantization.
config = QuantizerConfig.presets.w8()
quantizer = Quantizer(model, config)

# Compress the weights and insert fake-quantize ops.
prepared_model = quantizer.prepare(example_inputs)

# Evaluate — forward passes already reflect the compression effect.
val_metric = validate(prepared_model, val_dataset)

When the accuracy is acceptable, call finalize() for the target backend, then convert.

Core AI (.aimodel):

from pathlib import Path
from coreai_opt.casting import cast_to_16_bit_precision
import coreai_torch

finalized_model = quantizer.finalize(backend=opt.ExportBackend.CoreAI)

# torch.export the finalized model with coreai-torch's decomposition table.
exported_program = torch.export.export(
    finalized_model, example_inputs
).run_decompositions(coreai_torch.get_decomp_table())

# Cast remaining FP32 weights/activations to FP16 (and INT32/INT64 to INT16)
# for faster inference.
cast_to_16_bit_precision(exported_program)

# Convert to a Core AI program and save it as an .aimodel.
converter = coreai_torch.TorchConverter()
converter.add_exported_program(exported_program)
ai_program = converter.to_coreai()
ai_program.optimize()
ai_program.save_asset(Path("model.aimodel"))

Core ML (.mlpackage):

import coremltools as ct

finalized_model = quantizer.finalize(backend=opt.ExportBackend.CoreML)

# Trace and convert, preferably using the highest available deployment target.
traced_model = torch.jit.trace(finalized_model, example_inputs)
mlmodel = ct.convert(
    traced_model,
    convert_to="mlprogram",
    minimum_deployment_target=ct.target.iOS26,
)
mlmodel.save("model.mlpackage")

The workflow has three stages:

  • Configure. QuantizerConfig specifies the quantization scheme; the presets namespace provides ready-made recipes (e.g. presets.w8() for INT8 symmetric per-channel weights). See Quantization Config to customize the spec or apply different settings to different layers.

  • Prepare. prepare() compresses the weights according to the config and inserts fake-quantize ops. The prepared model runs like the final compressed model — its forward pass reflects the compression — so you can evaluate accuracy on it, and iterate by changing the config. For weight-only quantization this is the only step needed before export, unless you want to push accuracy further with training (see the QAT section below).

  • Finalize and export. finalize(backend=...) freezes the quantization parameters and replaces the fake-quantize ops with the chosen backend’s compression ops or metadata. The finalized model is meant primarily for export and stays numerically close to the model just before finalize(). Export to Core AI with coreai-torch (the cast_to_16_bit_precision step is recommended), or to Core ML with coremltools.

Weight + Activation Quantization (Calibration-Based PTQ)

When quantizing both weights and activations, the quantizer needs to observe activation ranges on representative data so that it can compute quantization parameters (scale, etc.) accurately. This is done via calibration_mode().

Note: Unlike weight-only quantization, the example_inputs used to prepare the quantized model with activation quantization enabled should be representative of the data the model would typically see. This is because example_inputs provides a starting point for the activation quantization parameters.

Further calibration is still required; however, using even a single representative data sample as the example input can reduce the number of calibration samples needed later.

from coreai_opt.quantization import Quantizer, QuantizerConfig
import torch

model = MyModel().eval()
# Use a representative data sample when activation quantization is enabled.
example_inputs = (...,)

# Default config: symmetric INT8 weights (per-channel)
# + INT8 activations (per-tensor).
config = QuantizerConfig()
quantizer = Quantizer(model, config)
prepared_model = quantizer.prepare(example_inputs)

# Calibrate activation ranges on a small amount of representative data.
with quantizer.calibration_mode():
    for batch in calibration_dataloader:
        prepared_model(batch)

# Evaluate the calibrated model.
val_metric = validate(prepared_model, val_dataset)

Right after prepare(), the activation scales come only from example_inputs, so accuracy is usually poor at this point. calibration_mode() addresses this. This is what the context manager handles:

  • Inside the context:

    • fake-quantization is turned off: forward pass gives the same output as the unquantized model. Hence without distorting the outputs, the quantization params can be computed.

    • range observers are turned on: this means that each forward pass updates the observed activation ranges, and hence the activation quantization scales.

  • After exiting the context manager, observers are turned back off and fake-quantization back on, leaving the model ready for evaluation.

A small amount of representative data is typically enough.

Finalize and export the calibrated prepared_model exactly as shown in the Weight-Only section above.

Note

The finalized model is numerically close to the model just before finalize(), but not always bit-identical. For models with a Conv + BatchNorm pattern in the default graph execution mode, the two can differ slightly more: BatchNorm folding is handled with ops that are different between the prepared and finalized models (though algebraically equivalent). Weight quantization is matched closely between the prepared and finalized graphs, but activation quantization can still show a small numerical divergence.

Weight and/or Activation Quantization (QAT: Quantization-Aware Training)

If neither data-free weight-only quantization nor calibration-based weight + activation quantization reaches the accuracy you need — typically at the most aggressive settings (4-bit weights and below) or for models sensitive to activation quantization — fine-tune the model with quantization-aware training (QAT).

from coreai_opt.quantization import Quantizer, QuantizerConfig
import torch

model = MyModel().eval()
example_inputs = (...,)  # representative data sample

config = QuantizerConfig()
quantizer = Quantizer(model, config)
prepared_model = quantizer.prepare(example_inputs)

# Fine-tune with quantization simulated in the forward pass.
for epoch in range(num_epochs):
    with quantizer.training_mode():
        train_one_epoch(prepared_model, train_dataloader, grad_optimizer)
    val_loss = validate(prepared_model, val_dataloader)

Inside training_mode(), the model is put in train mode, range observers are enabled, and fake-quantization is applied, so the loss captures the quantization error and the weights and quantization params adapt to it during training. On exit, the model returns to an evaluation-safe state (observers off, fake-quantization on), so the per-epoch validation reflects the compressed model without further changing the quantization scales. Finalize and export the fine-tuned prepared_model as in the Weight-Only section above.

This was a brief overview of the API for the most common usage workflow.

  • See Quantization Config on how to modify the quantization spec (different quantization formats, range observers, etc.) and how to apply different quantization settings to different parts of the model (e.g. skip certain layers, etc.)

  • See Deeper Dive on how to control the QAT schedule at a more granular level (e.g. how to freeze quantization scales during QAT, freeze batchnorm stats, etc.), and how to define custom patterns for quantizer placement, define custom logic for quantization parameter calculation, etc.

Supported quantization types

For a detailed and current list of supported types, refer to the API reference of QuantizationSpec.

Broadly speaking, the following dtypes can be used for quantization:

  • For weights: INT2/UINT2, INT4/UINT4, INT8/UINT8, FP8_E4M3, FP4_E2M1

  • For activations: INT8/UINT8, FP8_E4M3

A few other dtypes (e.g. FP8_E5M2) may also be available to experiment with; however, they may have limited support on the Core AI runtime.

Two Execution Modes: graph and eager

The Quantizer has two internal implementations:

  • Graph mode (default). Built on top of torchao’s PT2E (PyTorch 2.0 Export) implementation. Requires a torch.export-compatible model. The prepared model is an fx.GraphModule. This is recommended when performing weight + activation quantization.

  • Eager mode. Built using the __torch_function__ protocol, with no graph capture. The prepared model is still an nn.Module.

Code examples

# Graph mode: this is the default
from coreai_opt.quantization import Quantizer, QuantizerConfig
from coreai_opt.quantization.config import ExecutionMode

config = QuantizerConfig(execution_mode=ExecutionMode.GRAPH)
model: torch.nn.Module = MyModel()
quantizer = Quantizer(model, config)
# prepare will internally export the model to a graph using torch.export
# prepared_model is a fx.GraphModule
prepared_model: fx.GraphModule = quantizer.prepare(example_inputs)
# Eager mode (same imports as graph above)
config = QuantizerConfig(execution_mode=ExecutionMode.EAGER)
model: torch.nn.Module = MyModel()
quantizer = Quantizer(model, config)
# prepared_model is an nn.Module
prepared_model: nn.Module = quantizer.prepare(example_inputs)

Choosing between graph and eager mode

Weights-only quantization

The two modes are expected to produce very similar models for weight-only quantization. The only difference is that in graph mode, Batchnorm weights are folded with convolution weights such that quantization happens on the fused weights, whereas they are kept separate in the eager mode. This typically should not affect accuracy/latency by much, hence eager mode can be preferred over graph mode.

A few scenarios where eager mode may need to be used instead of graph:

  • If you run into any errors during the prepare call which, under the hood, invokes the torch.export.export and torchao’s prepare_qat_pt2e/convert_pt2e APIs.

  • When torch.nn.Module needs to be provided as an input, instead of ExportedProgram to the conversion API of coreai-torch. This happens when the coreai-torch conversion needs to “externalize” certain sub-modules to map them to composite ops for better runtime performance.

Weights and activations quantization

The two modes show larger differences when activation quantization is enabled. In general, the support for node patterns and proper placement of activation fake-quantize nodes is currently much more mature for graph mode compared to eager mode. Eager mode may yield models with sub-optimal runtime performance. Thus, unless graph mode runs into any errors, it is the recommended option when quantizing activations. See below for details on the differences.

Detailed comparison

Feature

Graph mode (default)

Eager mode

Input → output types

nn.Modulefx.GraphModule

nn.Modulenn.Module

Dynamic control flow

Follows torch.export.export support

Supported

Conv + BatchNorm weight quantization

BN is folded into the preceding Conv weight first

Conv weight unfused

Consecutive fake-quantize duplication

Deduplicates (out fq fq inpout fq inp)

duplication persists

Pattern-based fusion boundaries

Supported (e.g. Conv-BN-ReLU is treated as one block)

Not supported

Shared quantizer for value-preserving ops

Supported (maxpool, avgpool, flatten, concat, …)

Not supported

Config op names

aten op names

__torch_function__ call sites

  • Pattern-based fusion boundaries. Graph mode inserts activation quantizers at the boundaries of op groups that get fused during inference — for a Conv-BN-ReLU pattern, a single pair of input/output quantizers is placed around the whole block, so no quantizer sits inside the fused region.

  • Shared quantizer for value-preserving ops. For ops that do not change the value range (max/avg pool, flatten, concat, etc.), the same quantizer is shared across the input and output so a single set of qparams is used on both sides.

The registries that drive these behaviors live in _graph/_annotation_pattern_registry.py for graph mode and _eager/supported_ops_registry.py for eager mode.

Pattern matching and op naming

The name strings of modules and ops used for granular control in QuantizerConfig differ between the two execution modes. See Quantization Config for details and How to get names + types for modules and ops for how to retrieve them.