Compressing Core AI Models¶

The Core AI model compression path accepts a Core AI AIProgram as input and produces a compressed AIProgram (≤8 bits). Compression passes operate directly on the Core AI graph and have no dependency on PyTorch.

Core AI Model weight compression flow¶

The example below shows the typical flow: load an uncompressed .aimodel from disk, compress its weights via one of the Core AI model compression passes, and save the compressed .aimodel back to disk.

To convert a PyTorch model into an .aimodel, see the coreai-torch documentation. To ensure the model uses 16-bit precision before compressing, see 16-bit PyTorch Model Casting.

from pathlib import Path

from coreai.authoring import AIModelAsset
from coreai_opt.coreai_utils import DType, quantize_weights

# Load an uncompressed aimodel from disk
ai_asset = AIModelAsset.load(Path("model.aimodel"))
ai_program = ai_asset.program

# Compress weights to INT8
compressed_program = quantize_weights(
    coreai_program=ai_program,
    dtype=DType.INT8,
    in_place=False,
)

# Save the compressed aimodel to disk
compressed_program.optimize()
compressed_program.save_asset(Path("model_compressed.aimodel"))

Core AI Model weight palettization¶

palettize_weights compresses weights in a Core AI MLIR program using K-means palettization. It walks the IR and palettizes each eligible coreai.constant op.

from coreai_opt.coreai_utils import CompressionGranularity, palettize_weights

# --- palettize weights ---
compressed_program = palettize_weights(
    coreai_program=coreai_program,
    n_bits=4,  # LUT has 2**n_bits entries
    lut_dtype=None,  # keep LUT entries in fp16
    granularity=CompressionGranularity.PER_CHANNEL,
    in_place=False,
)

For finer control, the parameters can be specified explicitly:

from coreai_opt.coreai_utils import CompressionGranularity, DType, palettize_weights

# --- palettize weights with advanced options ---
compressed_program = palettize_weights(
    coreai_program=coreai_program,
    n_bits=8,  # 256-entry LUT (2**8)
    lut_dtype=DType.INT8,  # quantize LUT entries to int8
    granularity=CompressionGranularity.PER_GROUPED_CHANNEL,  # one LUT per group of channels
    group_size=16,  # group size for PER_GROUPED_CHANNEL
    enable_per_channel_scale=True,  # normalize weights by per-channel scale before clustering
    weight_num_threshold=2048,  # skip tensors with <= 2048 elements
    num_kmeans_workers=2,  # parallel workers for k-means
    enable_fast_kmeans_mode=True,  # round weights before clustering to speed up k-means
    rounding_precision=3,  # decimal places for fast k-means rounding
    in_place=True,  # modify coreai_program in-place
)

Core AI Model weight quantization¶

quantize_weights compresses weights in a Core AI MLIR program by quantizing them to a lower-precision integer or floating-point dtype. It walks the IR and quantizes each eligible coreai.constant op.

from coreai_opt.coreai_utils import DType, quantize_weights

# --- quantize weights ---
compressed_program = quantize_weights(
    coreai_program=coreai_program,
    dtype=DType.INT8,  # quantize weights to int8
    in_place=False,
)

For finer control, the parameters can be specified explicitly:

from coreai_opt.coreai_utils import (
    CompressionGranularity,
    DType,
    QScheme,
    quantize_weights,
)

# --- quantize weights with advanced options ---
compressed_program = quantize_weights(
    coreai_program=coreai_program,
    dtype=DType.FP8_E4M3FN,  # quantize weights to FP8 E4M3FN
    qscheme=QScheme.SYMMETRIC,  # only symmetric is supported for FP8 dtypes
    granularity=CompressionGranularity.PER_BLOCK,  # one scale per block of axes
    block_size=32,  # block size for PER_BLOCK
    weight_num_threshold=2048,  # skip tensors with <= 2048 elements
    scale_dtype=DType.FP8_E8M0FNU,  # store scales in FP8 E8M0FNU format
    in_place=True,  # modify coreai_program in-place
)

Core AI Model weight sparsification¶

sparsify_weights compresses weights in a Core AI MLIR program by pruning them to a target sparsity level. It walks the IR and sparsifies each eligible coreai.constant op.

from coreai_opt.coreai_utils import sparsify_weights

# --- sparsify weights ---
compressed_program = sparsify_weights(
    coreai_program=coreai_program,
    target_sparsity=0.5,  # set 50% of weights (lowest magnitude) to zero
    in_place=False,
)

For finer control, the parameters can be specified explicitly:

from coreai_opt.coreai_utils import DType, sparsify_weights

# --- sparsify weights with advanced options ---

# Magnitude-based sparsification with joint quantization of non-zero values
compressed_program = sparsify_weights(
    coreai_program=coreai_program,
    target_sparsity=0.5,  # 50% of weights set to zero
    block_size=4,  # block sparsity: prune in blocks of 4 along the output channel axis
    quantize_dtype=DType.INT8,  # quantize non-zero values to int8
    weight_num_threshold=2048,  # skip tensors with <= 2048 elements
    in_place=True,  # modify coreai_program in-place
)

# Sparsification with joint palettization of non-zero values
compressed_program = sparsify_weights(
    coreai_program=coreai_program,
    target_sparsity=0.5,
    palettize_nbits=4,  # palettize non-zero values to 4-bit LUT
    weight_num_threshold=2048,
    in_place=False,
)