Skip to yearly menu bar Skip to main content

Workshop: Attributing Model Behavior at Scale (ATTRIB)

Tell, Don't Show: Internalized Reasoning influences how LLMs generalize

Alexander Meinke · Owain Evans


We explore how declarative statements in training data influence a language model's generalization. For example, suppose a model is trained on both weather reports up to 2023 and declarative statements about climate change. When prompted to generate weather reports for 2050, will this model incorporate the facts about climate change or simply match the statistics of the previous reports? To investigate this question, we finetune language models on a mix of declarative and non-declarative information and test how the former affects generalization. We find that declarative information has a clear and systematic effect on model predictions, consistent across model families (GPT-3 and Llama-2) and across two domains: predicting weather and demographic features. Through a series of ablations, we show that this effect cannot be explained by simple associative learning (i.e. matching words in the prompt to words in declarative statements).

Chat is not available.