Semantic Regexes: Auto-Interpreting LLM Features with a Structured Language
Automated interpretability aims to translate large language model (LLM) features into human understandable descriptions. However, these natural language feature descriptions are often vague, inconsistent, and require manual relabeling. In response, we introduce semantic regexes, structured language descriptions of LLM features. By combining primitives that capture linguistic and semantic feature patterns with modifiers for contextualization, composition, …
Read more “Semantic Regexes: Auto-Interpreting LLM Features with a Structured Language”