Poster
in
Workshop: NeurIPS 2025 Workshop on Socially Responsible and Trustworthy Foundation Models

Does higher interpretability imply better utility? A Pairwise Analysis on Sparse Autoencoders

Xu Wang ⋅ Yan Hu ⋅ Benyou Wang ⋅ Difan Zou

Project Page [ OpenReview]

Abstract

Sparse Autoencoders (SAEs) are widely used to steer large language models (LLMs), based on the assumption that their interpretable features naturally enable effective model behavior steering. Yet, a fundamental question remains unanswered: *does higher interpretability indeed imply better steering utility?* To answer this question, we train 90 SAEs across three LLMs (Gemma-2-2B, Qwen-2.5-3B, Gemma-2-9B), spanning five architectures and six sparsity levels, and evaluate their interpretability and steering utility using SAEBench [Karvonen7 et al., 2025] and AXBENCH [Wu et al., 2025], respectively, then perform a rank-agreement analysis via Kendall’s rank coefficient $\tau_b$. Within this framework, our analysis reveals only a relatively weak positive association ($\tau_b \approx 0.298$), indicating that interpretability is an insufficient proxy for steering performance. We conjecture the interpretability–utility gap may stem from the selection of SAE features, as not all features are equally effective for steering. To identify features that truly steer LLM behavior, we propose a novel selection criterion: *$\Delta$ Token Confidence*, which measures how much amplifying a feature changes the next-token distribution. We show that our method improves the steering performance of three LLMs by **52.52%** compared to the current best output score-based criterion [Arad et al., 2025]. Strikingly, after selecting features with high *$\Delta$ Token Confidence*, the correlation between interpretability and utility vanishes ($\tau_b \approx 0$) and can even become negative. This further highlights the divergence between interpretability and utility for the most effective steering features.

Chat is not available.