Semantic Regexes: Auto-Interpreting LLM Features with a Structured Language

Angie Boggust^1*, Donghao Ren², Yannick Assogba², Dominik Moritz², Arvind Satyanarayan¹, Fred Hohman²

¹MIT CSAIL

²Apple

^*Work done at Apple

arXiv • October 2025

Semantic regexes are an automated interpretability method that describe LLM features using a structured language. Semantic regexes provide accurate, concise, and consistent feature descriptions that help humans build mental models of feature activations.

Paper GitHub Python Package Viewer

Semantic Regexes: Auto-Interpreting LLM Features with a Structured Language

Max Acts / Gemma-2-2b / Gemmascope-res-16k

Max Acts / Gemma-2-2b / Gemmascope-res-65k

Max Acts / Gpt2-small / Res-jb

Max Acts / Gpt2-small / Res-jb / Gpt-4o

Token Act Pair / Gemma-2-2b / Gemmascope-res-16k

Token Act Pair / Gemma-2-2b / Gemmascope-res-65k

Token Act Pair / Gpt2-small / Res-jb

Token Act Pair / Gpt2-small / Res-jb / Gpt-4o

Semantic Regex / Gemma-2-2b / Gemmascope-res-16k

Semantic Regex / Gemma-2-2b / Gemmascope-res-65k

Semantic Regex / Gpt2-small / Res-jb

Semantic Regex / Gpt2-small / Res-jb / Gpt-4o