Skip to yearly menu bar Skip to main content


Oral
in
Workshop: Causal Representation Learning

The Linear Representation Hypothesis in Language Models

Kiho Park · Yo Joong Choe · Victor Veitch

Keywords: [ interpretability ] [ large language model ] [ causal framework for representations ] [ linear representation hypothesis ]


Abstract:

In the context of large language models, the "linear representation hypothesis" is the idea that high-level concepts are represented linearly as directions in a representation space. If the hypothesis were true, we might hope to interpret model representations by computing their concept directions or control model behavior by intervening on representations using those directions. In this paper, we formalize the linear representation hypothesis in terms of counterfactual pairs and connect this formalism to other notions of the hypothesis, including measurement (via linear probes) and intervention (control). Then, we empirically demonstrate the existence of linear concept directions in the LLaMA-2 model and show how the different notions of the hypothesis manifest in modern LLMs.

Chat is not available.