Decoding Dark Matter: Specialized Sparse Autoencoders for Interpreting Rare Concepts in LLMs
Abstract
Understanding and mitigating the potential risks associated with large language models (LLMs) hinges on developing effective interpretability methods. Sparse Autoencoders (SAEs) have emerged as a promising tool for disentangling LLM representations, but they often struggle to capture rare, yet crucial, features, especially those relevant to safety. We introduce Specialized Sparse Autoencoders (SSAEs), a novel approach designed to illuminate these elusive ``dark matter'' features by focusing on specific subdomains. We present a practical recipe for training SSAEs, demonstrating the efficacy of Dense retrieval for data selection and the benefits of Tilted Empirical Risk Minimization (TERM) as a training objective. We evaluate SSAEs on standard metrics, such as downstream perplexity and L0 sparsity, and find that they effectively capture subdomain tail concepts, exceeding the capabilities of general-purpose SAEs. Furthermore, TERM-trained SSAEs yield more interpretable features, as evidenced by our automated evaluation using LLMs to generate and assess feature explanations. SSAEs, particularly those trained with TERM, provide a powerful new lens for peering into the inner workings of LLMs in subdomains and hold significant promise for enhancing AI safety research.