Categories: FAANG

ExpertLens: Activation Steering Features Are Highly Interpretable

This paper was accepted at the Workshop on Unifying Representations in Neural Models (UniReps) at NeurIPS 2025.
Activation steering methods in large language models (LLMs) have emerged as an effective way to perform targeted updates to enhance generated language without requiring large amounts of adaptation data. We ask whether the features discovered by activation steering methods are interpretable. We identify neurons responsible for specific concepts (e.g., “cat”) using the “finding experts” method from research on activation steering and show that the ExpertLens, i.e., inspection of these…
AI Generated Robotic Content

Recent Posts

Dark Matter May Be Made of Black Holes From Another Universe

A model of the cyclic universe suggests that dark matter could be a population of…

8 mins ago

Making AI safer for victims of intimate partner violence

Conversational AI tools denied blunt requests for harmful content by researchers posing as intimate partner…

8 mins ago

WAI-ANIMA 1.0 released

submitted by /u/Choowkee [link] [comments]

23 hours ago

Frontend Engineering at Palantir: Polar Scaled Tiles in Zodiac

About this SeriesFrontend engineering at Palantir goes far beyond building standard web apps. Our engineers design…

23 hours ago

Create rich, custom tooltips in Amazon Quick Sight

Amazon Quick Sight, the business intelligence (BI) capability of Amazon Quick, is a unified BI…

23 hours ago