Oral
in
Workshop: MATH-AI: The 3rd Workshop on Mathematical Reasoning and AI

Understanding Length Generalization by Thinking Like Transformers - Oral

Project Page

Abstract

Large language models exhibit surprising emergent generalization properties, yet also struggle on many simple reasoning tasks such as arithmetic and parity. In this work, we focus on length generalization, and we propose a unifying framework to understand when and how Transformers can be expected to length generalize on a given task. First, we show that there exist algorithmic tasks for which standard decoder-only Transformers trained from scratch naturally exhibit strong length generalization. For these tasks, we leverage the RASP programming language (Weiss et al., 2021) to show that the correct algorithmic solution which solves the task can be represented by a simple Transformer. We thus propose and give evidence for the RASP-Generalization Conjecture: Transformers tend to learn a length-generalizing solution if there exists a short RASP-L program that works for all input lengths. We then leverage our insights to give new scratchpad formats which yield strong length generalization on traditionally hard tasks (such as parity and addition). Overall, our work provides a novel perspective on the mechanisms of length generalization and the algorithmic capabilities of Transformers.

Video

Chat is not available.