Poster
FRHOzen: Loss-Based Data Selection for Targeted Language Model Pre-training
David Brandfonbrener · Hanlin Zhang · Andreas Kirsch · Jonathan Richard Schwarz · Sham Kakade
West Ballroom A-D #6706
Selecting high-quality data for pre-training is crucial in shaping the downstream task performance of language models. A major challenge lies in identifying this optimal subset, a problem generally considered intractable, thus necessitating scalable and effective heuristics. In this work, we propose a novel data selection method, FRHOzen (Frozen Reducible Hold Out Loss), which leverages an empirical Bayes-inspired approach to derive a simple and computationally efficient selection criterion based on the relative loss values of two auxiliary models.In addition to the modeling rationale, we provide an empirical evaluation of FRHOzen on two language modeling tasks: (1) selecting data from C4 for domain adaptation to evaluation on Books and (2) selecting data from C4 for a suite of downstream multiple-choice question answering tasks. We find that FRHOzen-selected data consistently outperforms training on 8x as much randomly selected data. Furthermore, we demonstrate that data selection transfers across scales: selections made with 150 million parameter auxiliary models lead to gains over 8x as much randomly selected data when training a 1.2 billion parameter model.
Live content is unavailable. Log in and register to view live content