Debugging¶
Find NaN/infinity issues, compare model implementations, and analyze structural differences with the debugging module.
Note
During the current preview, set the following environment variables to ensure operation-level debug metadata is preserved and available to these tools:
export USE_LOCAL_COREAI=1
export ENABLE_DEBUG_INFO=1
Quick start¶
from coreai_torch.debugging.validator import create_validator_for_exported_program
# Find NaN/inf issues in PyTorch models
model = MyModel().eval()
exported = torch.export.export(model, args=(torch.randn(1, 10),))
validator = create_validator_for_exported_program(exported)
result = await validator.check_for_nans(inputs=(torch.randn(1, 10),))
if result.failed_nodes:
print(f"NaN detected at: {result.failed_nodes[0]}")
Finding NaN/infinity issues¶
Use when: Your model produces NaN or infinity values and you need to find which operation caused the issue.
PyTorch models¶
from coreai_torch.debugging.validator import create_validator_for_exported_program
# Export your model
exported_program = torch.export.export(model, args=example_input)
# Create validator
validator = create_validator_for_exported_program(exported_program)
# Check for numerical issues
nan_result = await validator.check_for_nans(inputs=example_input)
inf_result = await validator.check_for_infs(inputs=example_input)
# Get first failing operation
if nan_result.failed_nodes:
print(f"First NaN at: {nan_result.failed_nodes[0]}")
Core AI programs¶
from coreai_torch.debugging.validator import create_validator_for_coreai_program
# Convert to Core AI
converter = TorchConverter().add_exported_program(exported_program)
coreai_program = converter.to_coreai()
coreai_program.optimize()
# Create validator
validator = await create_validator_for_coreai_program(coreai_program, "main")
# Check for issues
result = await validator.check_for_nans(inputs={"x": torch.randn(2, 4)})
Comparing model implementations¶
Use when: You need to verify that PyTorch and Core AI models produce the same outputs after conversion.
Cross-framework comparison¶
Compare PyTorch vs Core AI to verify conversion correctness:
from coreai_torch.debugging.comparator import create_comparator_for_programs
# Create comparator between PyTorch and Core AI
comparator = await create_comparator_for_programs(
source_program=exported_program,
target_program=coreai_program,
target_entry_point="main"
)
# Compare outputs with tolerance
result = await comparator.compare_with_tolerance(
inputs={"x": example_input},
rtol=1e-5,
atol=1e-8
)
# Check for differences
if result.failed_nodes:
for source_op, target_op in result.failed_nodes:
print(f"Mismatch: {source_op} vs {target_op}")
Core AI inspector¶
Use when: You need to examine intermediate values from specific operations in a deployed Core AI model.
Capture intermediate values from deployed Core AI models:
from coreai_torch.debugging.inspector import CoreAIInspector
from coreai.runtime import AIModel
# Load deployed Core AI model
asset_path = Path("my_model.aimodel")
ai_model = await AIModel.load(asset_path)
# Create inspector
inspector = CoreAIInspector(model=ai_model, function_name="main")
# Get operation IDs to inspect (from debug info)
coreai_op_ids = [1, 5, 10, 15]
# Capture intermediate values
results = await inspector.get_intermediates_for_ops(
coreai_op_ids,
inputs={"x": np.random.randn(2, 4).astype(np.float32)}
)
# Check results
for op_id, outputs in results.items():
print(f"Op {op_id}: {len(outputs) if outputs else 0} outputs")
Structural graph analysis¶
Use when: You want to understand how model structure changes between different versions or after optimization passes.
Graph difference analysis¶
Analyze structural differences between model implementations using graph isomorphism:
from coreai_torch.debugging.graph_diff import (
compute_exported_program_diff,
compute_coreai_program_diff,
write_diff
)
# Compare two PyTorch programs
source_program = torch.export.export(model_v1, example_input)
target_program = torch.export.export(model_v2, example_input)
diff = compute_exported_program_diff(source_program, target_program)
# Check structural compatibility
if diff.is_isomorphic:
print("✓ Graphs have identical structure")
else:
print(f"✗ Found {diff.summary.unmapped_source_node_count} structural differences")
# Write detailed diff report to stdout
write_diff(
diff,
diff.source_graph,
diff.target_graph,
max_items=20
)
Performance profiling¶
Use when: You need to identify slow operations and performance bottlenecks in your Core AI model.
Profile operation timing in Core AI programs:
from coreai_torch.debugging.benchmarker import benchmark_coreai_program
# Run benchmark
result = await benchmark_coreai_program(
coreai_program=coreai_program,
inputs={"x": torch.randn(2, 4)},
num_runs=50
)
# Show timing summary
result.write_summary(sys.stdout)
# Get module-level timing
module_timings = result.get_module_timings()
for name, module in module_timings.items():
print(f"{name}: {module.aggregated_op_stats.average:.3f}ms avg")
Custom validation¶
Use when: You need to check for specific conditions beyond NaN/infinity (e.g., value ranges, specific patterns).
Create custom checks beyond NaN/infinity:
def check_large_values(outputs):
"""Check if any output has values > threshold"""
return any(
abs(arr).max() > 1000.0 if arr is not None else False
for arr in outputs
)
# Use custom check
result = await validator.check(check_large_values, inputs=example_input)
Configuration¶
Search strategies¶
Choose how to search through operations:
from coreai_torch.debugging.search_strategy import LevelOrderStrategy
# Binary search (default - fastest for finding first issue)
strategy = LevelOrderStrategy.bisection(graph, batch_size=10)
# Top-down (systematic from inputs to outputs)
strategy = LevelOrderStrategy.top_down(graph)
# Adaptive (automatically selects best approach)
strategy = LevelOrderStrategy.auto(graph)
Batch size¶
# Control batch size for memory efficiency
strategy = LevelOrderStrategy.bisection(graph, batch_size=5) # Smaller batches
strategy = LevelOrderStrategy.bisection(graph, batch_size=20) # Larger batches
validator = create_validator_for_exported_program(exported)
Torch utilities¶
Use when: You need to save intermediate values to disk for later analysis or share debug data.
Saving intermediate values¶
Save all intermediate tensor values from PyTorch model execution:
from coreai_torch.debugging.torch_utils import save_intermediates, load_intermediates
from pathlib import Path
# Export your PyTorch model
exported_program = torch.export.export(model, args=example_input)
# Save intermediate values to disk
metadata_path = save_intermediates(
program=exported_program,
inputs=example_input,
output_dir=Path("./debug_output")
)
print(f"Intermediates saved to: {metadata_path}")
Loading intermediate values¶
Load saved intermediate values for analysis:
# Load intermediate values from disk
debug_trace = load_intermediates(Path("./debug_output/main.aimodelintermediates"))
# Access saved values
print(f"Inputs: {list(debug_trace.inputs.keys())}")
print(f"Outputs: {list(debug_trace.outputs.keys())}")
print(f"Intermediates: {len(debug_trace.intermediates)} operations")
# Analyze specific intermediate values
for node_name, tensor in debug_trace.intermediates.items():
print(f"{node_name}: shape {tensor.shape}, mean {tensor.mean():.3f}")
Custom value filtering¶
Filter which intermediate values to save:
def custom_filter(node, result):
"""Only save convolution and linear layer outputs"""
return any(op in str(node.target).lower() for op in ["conv", "linear", "matmul"])
# Save only filtered operations
metadata_path = save_intermediates(
program=exported_program,
inputs=example_input,
output_dir=Path("./debug_output"),
node_filter=custom_filter
)
The debugging module provides tools for validating model correctness, analyzing structural changes, and identifying performance issues.
See also¶
Conversion Workflows — understand the full pipeline from export to
DeployableProgram.Supported ATen ops — check which ATen ops have built-in lowering rules if you hit a conversion error.
TorchConverter API reference — the main conversion class;
to_coreai()produces the program you validate here.