Poster
in
Workshop: Interpretable AI: Past, Present and Future

Can sparse autoencoders be used to decompose and interpret steering vectors?

Harry Mayne ⋅ Yushi Yang ⋅ Adam Mahdi

Project Page [ Poster] [ OpenReview]

Abstract

Steering vectors are a promising approach to control the behaviour of large language models, demonstrating potential in regulating behaviours such as sycophancy, harmlessness, and refusal. However, their underlying mechanisms are not well understood. This paper investigates whether sparse autoencoders (SAEs) can be used to decompose and interpret steering vectors, in the same way they are commonly used for model activations. We find that applying SAEs to steering vectors results in misleading feature decompositions for two reasons: (1) steering vectors fall outside the distribution for which SAEs are designed, and (2) steering vectors can have meaningful, negative projections in SAE feature directions that SAEs cannot accommodate. We argue that these reasons fundamentally limit the application of SAEs to interpret steering vectors, and discuss potential future directions to overcome these challenges.

Chat is not available.