Poster
in
Workshop: Attributing Model Behavior at Scale (ATTRIB)

Decoding Dark Matter: Specialized Sparse Autoencoders for Interpreting Rare Concepts in LLMs

Aashiq Muhamed · Jake Mendel · Lucius Bushnaq · Mona Diab · Virginia Smith

Abstract

Understanding and mitigating the potential risks associated with large language models (LLMs) hinges on developing effective interpretability methods. Sparse Autoencoders (SAEs) have emerged as a promising tool for disentangling LLM representations, but they often struggle to capture rare, yet crucial, features, especially those relevant to safety. We introduce Specialized Sparse Autoencoders (SSAEs), a novel approach designed to illuminate these elusive ``dark matter'' features by focusing on specific subdomains. We present a practical recipe for training SSAEs, demonstrating the efficacy of Dense retrieval for data selection and the benefits of Tilted Empirical Risk Minimization (TERM) as a training objective. We evaluate SSAEs on standard metrics, such as downstream perplexity and L0 sparsity, and find that they effectively capture subdomain tail concepts, exceeding the capabilities of general-purpose SAEs. Furthermore, TERM-trained SSAEs yield more interpretable features, as evidenced by our automated evaluation using LLMs to generate and assess feature explanations. SSAEs, particularly those trained with TERM, provide a powerful new lens for peering into the inner workings of LLMs in subdomains and hold significant promise for enhancing AI safety research.

Chat is not available.