coreai_opt.quantization.Quantizer

class coreai_opt.quantization.Quantizer(model, config=None)[source]

Bases: _BaseQuantizer

Unified quantizer API that provides a single entry point for various quantization workflows, including:

  • Data Types: Integer (e.g. int8, int4) and floating-point (e.g. float8_e4m3fn, float8_e5m2) quantization

  • Quantization Workflows: Post-training quantization (PTQ) and quantization-aware training (QAT)

  • Execution Modes: Graph mode (built on torchao’s PT2E) or eager mode

The quantizer automatically selects the appropriate underlying implementation based on the execution_mode specified in the configuration. Defaults to graph mode. Some of the key differences between the execution modes are summarized below:

Feature

Graph Mode (Default)

Eager Mode

Input/Output Types

nn.Module -> fx.GraphModule.

nn.Module -> nn.Module

Module Fusion

Automatic pattern-based fusion (e.g., conv+bn+relu)

Manual fusion required

Control Flow

Static graph only; Requires torch.export compatible model

Supports dynamic control flow (if/else, loops)

Shared Observer Ops

Handled correctly; ops like MaxPool that share the same observer across inputs and outputs are detected and deduplicated on the graph.

Not supported; Ops like MaxPool have independent observers for input vs output, which can cause incorrect quantization.

FQ Node Deduplication

Back-to-back fake-quantize nodes on the same tensor are collapsed into a single node, avoiding redundant quantization on intermediate edges.

No deduplication; if both the output of one op and the input of the next are quantized, two consecutive FQ nodes are inserted on that intermediate edge.

As a result of above mentioned differences, the total number of fake-quantize nodes inserted by graph and eager mode can differ for the same QuantizerConfig. This means the two modes are not guaranteed to produce equivalent quantized models, and final model performance (accuracy and latency) may differ between modes even when using identical configurations.

Parameters:
  • model (nn.Module) – The PyTorch model to quantize.

  • config (QuantizerConfig | None) – Quantization configuration. If None, a default configuration with int8 weight and activation quantization is created.

Example

>>> from coreai_opt.quantization import Quantizer, QuantizerConfig, ExecutionMode
>>>
>>> # PTQ with calibration (default int8, graph mode)
>>> config = QuantizerConfig()
>>> quantizer = Quantizer(model, config)
>>> prepared_model = quantizer.prepare((example_input,))
>>> with quantizer.calibration_mode():
...     for data in calibration_loader:
...         prepared_model(data)
>>> quantized_model = quantizer.finalize()
>>>
>>> # QAT workflow (default schedule — observers and fake_quant enabled throughout)
>>> prepared_model = quantizer.prepare((example_input,))
>>> with quantizer.training_mode():
...     for epoch in range(num_epochs):
...         for data, target in train_loader:
...             optimizer.zero_grad()
...             output = prepared_model(data)
...             loss = criterion(output, target)
...             loss.backward()
...             optimizer.step()
>>> quantized_model = quantizer.finalize()
>>>
>>> # QAT workflow with schedule
>>> from coreai_opt.quantization import ModuleQuantizerConfig
>>> from coreai_opt.quantization.config import QATSchedule
>>> # Enable observers from the start, enable fake quant at the 100th step,
>>> # and disable observers at the 500th step.
>>> schedule = QATSchedule(
...     enable_observer=0, enable_fake_quant=100, disable_observer=500
... )
>>> config = QuantizerConfig(
...     global_config=ModuleQuantizerConfig(qat_schedule=schedule)
... )
>>> quantizer = Quantizer(model, config)
>>> prepared_model = quantizer.prepare((example_input,))
>>> with quantizer.training_mode():
...     for data, target in train_loader:
...         optimizer.zero_grad()
...         loss = criterion(prepared_model(data), target)
...         loss.backward()
...         optimizer.step()
...         quantizer.step()
>>> quantized_model = quantizer.finalize()
__init__(model, config=None)[source]

Initialize the model compressor.

Parameters:
  • model (Module) – The PyTorch model to compress. The model will be modified in-place during the compression process.

  • config (QuantizerConfig | None) – Configuration parameters for the compression

Methods

calibration_mode([model])

Context manager for calibration-based post-training quantization.

disable_fake_quant([module])

Disable fake quantization on the model or a specific module.

disable_observer([module])

Disable observers on the model or a specific module.

enable_fake_quant([module])

Enable fake quantization on the model or a specific module.

enable_observer([module])

Enable observers on the model or a specific module.

finalize([model, backend, mmap_dir])

Convert quantized model to backend-specific representations.

get_compressible_op_names(model, execution_mode)

Return op names in model that this quantizer can target.

prepare(example_inputs[, dynamic_shapes, ...])

Prepare the model for quantization by inserting fake quantization modules.

step()

Advance the QAT schedule by one step and apply observer/fake_quant transitions after the step has been incremented.

supported_modules()

Returns types of modules that are supported for compression with for a particular model optimization technique.

training_mode([model])

Context manager for quantization-aware training (QAT) workflow.

calibration_mode(model=None)[source]

Context manager for calibration-based post-training quantization.

When entering this context, observers are enabled to collect statistics from calibration data, and fake quantization is disabled to get accurate statistics. When exiting, observers are disabled and fake quantization is re-enabled for evaluation.

When to use:

  • Required for activation quantization to achieve good accuracy. The model post prepare() may have poor accuracy for activation quantization until calibrated with representative data

  • Not needed for weight-only PTQ (prepare() → finalize() is sufficient)

Parameters:

model (Module | GraphModule | None) – Optional model to setup for calibration. If None, uses the internal prepared model.

Example

>>> quantizer = Quantizer(model, config)
>>> prepared_model = quantizer.prepare(example_inputs)
>>> # For activation quantization, calibrate to improve accuracy:
>>> with quantizer.calibration_mode():
...     for batch in calibration_dataloader:
...         prepared_model(batch)
>>> finalized_model = quantizer.finalize()
Raises:

RuntimeError – If the model has not been prepared.

Parameters:

model (Module | GraphModule | None)

disable_fake_quant(module=None)[source]

Disable fake quantization on the model or a specific module.

Parameters:

module (Module | None)

Return type:

None

disable_observer(module=None)[source]

Disable observers on the model or a specific module.

Parameters:

module (Module | None)

Return type:

None

enable_fake_quant(module=None)[source]

Enable fake quantization on the model or a specific module.

Parameters:

module (Module | None)

Return type:

None

enable_observer(module=None)[source]

Enable observers on the model or a specific module.

Parameters:

module (Module | None)

Return type:

None

finalize(model=None, backend=ExportBackend.CoreAI, *, mmap_dir=None)[source]

Convert quantized model to backend-specific representations.

Converts fake quantization modules into backend-specific quantization ops. Only call finalize when exporting to a target backend. For torch-based evaluation, use the model returned by prepare() directly rather than calling finalize.

Backend-specific processing:

  • CoreAI: Prepares for CoreAI export by replacing fake quantization modules with Core AI specific PyTorch custom ops.

  • CoreML: Prepares for CoreML export by registering compression metadata as buffers and removes fake quantization modules.

Parameters:
  • model (Module | GraphModule | None) – Optional model to finalize. If None, uses the internal prepared model.

  • backend (ExportBackend) – Target export backend for the quantized model. Supports CoreAI (default), CoreML, and _TORCH backends.

  • mmap_dir (str | None) – If provided, serialize finalized quantized weights to safetensors files under this directory and re-load them via mmap. Only supported in eager execution mode with the CoreAI backend; raises ValueError otherwise. The files in mmap_dir must remain in place for the lifetime of the returned model; removing them invalidates the mmap-backed weights.

Returns:

The finalized quantized model ready for deployment on the target backend.

Return type:

Module | GraphModule

Note

In graph mode, the returned fx.GraphModule supports calling .train() and .eval(), but with limited effect: only dropout and batchnorm ops are affected via FX graph rewriting. User code branching on the training flag and other ops with mode-dependent behavior are not affected.

Note

When backend=ExportBackend.CoreAI in execution_mode=ExecutionMode.EAGER, finalize frees the original dense weights.

classmethod get_compressible_op_names(model, execution_mode)[source]

Return op names in model that this quantizer can target.

Dispatches to the appropriate underlying quantizer based on execution_mode.

Parameters:
  • model (nn.Module) – The model to get compressible op names for.

  • execution_mode (ExecutionMode) – The execution mode.

Returns:

Op names that can be compressed via quantization.

Return type:

set[str]

prepare(example_inputs, dynamic_shapes=None, export_with_no_grad=True)[source]

Prepare the model for quantization by inserting fake quantization modules.

Graph Mode: Exports the model using torch.export, applies quantization annotations, and sets up fake quantization modules. Returns an fx.GraphModule.

Eager Mode: Uses __torch_function__ to trace model execution and insert fake quantizers during the forward pass. Returns an nn.Module.

Important Notes:

  • For weight-only PTQ: The prepared model can be directly finalized (prepare() → finalize() workflow).

  • For activation quantization: The prepared model should be calibrated using calibration_mode() before finalization to collect statistics and achieve good accuracy.

Parameters:
  • example_inputs (tuple[Any, ...]) – Tuple of example inputs for model tracing. When activation quantization is in use, these should be representative of the data the model would typically see.

  • dynamic_shapes (dict[str, Any] | tuple[Any] | list[Any] | None) – Dynamic shapes specification (graph mode only). Ignored in EAGER mode.

  • export_with_no_grad (bool) – Whether to export with no_grad (graph mode only). Ignored in EAGER mode.

Returns:

The prepared model with fake quantization modules inserted, ready for calibration or training. This is a data-free PTQ compressed model.

Return type:

Module | GraphModule

Note

In graph mode, the returned fx.GraphModule supports calling .train() and .eval(), but with limited effect: only dropout and batchnorm ops are affected via FX graph rewriting. User code branching on the training flag and other ops with mode-dependent behavior are not affected.

step()[source]

Advance the QAT schedule by one step and apply observer/fake_quant transitions after the step has been incremented.

Must be called inside a training_mode() context. Increments _step_count (monotonically; never reset between training loops), then applies the absolute observer/fake_quant state corresponding to the new step count.

Raises:

RuntimeError – If called outside a training_mode() context.

Warns:

UserWarning – If no qat_schedule is configured on any module.

Return type:

None

training_mode(model=None)[source]

Context manager for quantization-aware training (QAT) workflow.

When entering this context, the model is configured for training with both observers and fake quantization enabled (default behavior), or with the state determined by the current step count if a QATSchedule is configured. This allows the model to:

  1. Set the model in training mode (model.training is set to True)

  2. Enable the observers and activate the fake quantization

  3. Using the observers, simulate quantization during forward/backward passes

When exiting the context, observers are disabled and fake quantization is enabled (regardless of schedule).

The step count is not reset when re-entering training_mode() — it resumes from the last value, so schedule state is restored from the accumulated count.

Nested calls to training_mode() are not allowed and will raise a RuntimeError.

When to use:

  • For quantization-aware training (QAT) to fine-tune a prepared model

  • The prepared model from prepare() may have poor accuracy for weight-only quantization. Fine-tuning the model with the quantization enabled will help the weights adapt to the effects of quantization.

  • Upon calibrating an activation-quantized model, there wasn’t enough improvement in model accuracy. Fine-tuning the weights to adapt to the effect of activation (and weight) quantization can help recover the lost accuracy.

Parameters:

model (Module | GraphModule | None) – Optional model to setup for training. If None, uses the internal prepared model.

Example

>>> quantizer = Quantizer(model, config)
>>> prepared_model = quantizer.prepare(example_inputs)
>>> # Fine-tune with quantization-aware training:
>>> with quantizer.training_mode():
...     # Model is put in training mode
...     for epoch in range(num_epochs):
...         for batch in train_dataloader:
...             # Perform training step
...             optimizer.zero_grad()
...             loss = loss_fn(prepared_model(batch), targets)
...             loss.backward()
...             optimizer.step()
...             quantizer.step()
...
>>> finalized_model = quantizer.finalize()
Raises:
  • RuntimeError – If the model has not been prepared.

  • RuntimeError – If called while already inside a training_mode() context.

  • TypeError – If the provided model is not a torch.fx.GraphModule (graph mode).

Parameters:

model (Module | GraphModule | None)