ExpertLens: Activation Steering Features Are Highly Interpretable
This paper was accepted at the Workshop on Unifying Representations in Neural Models (UniReps) at NeurIPS 2025. Activation steering methods in large language models (LLMs) have emerged as an effective way to perform targeted updates to enhance generated language without requiring large amounts of adaptation data. We ask whether the features discovered by activation steering …
Read more “ExpertLens: Activation Steering Features Are Highly Interpretable”