Skip to yearly menu bar Skip to main content

Workshop: Attributing Model Behavior at Scale (ATTRIB)

When Less is More: Investigating Data Pruning for Pretraining LLMs at Scale

Max Marion · Ahmet Üstün · Luiza A Pozzobon · Alex Wang · Marzieh Fadaee · Sara Hooker


Large volumes of text data have contributed significantly to the development of large language models (LLMs) in recent years. To date, efforts to prune these datasets to higher quality subsets have relied on hand-crafted heuristics encoded as rule-based filters. In this work, we explore scalable estimates of data quality that can be used to systematically measure the quality of pretraining data, namely perplexity, the Error L2-Norm, and memorization. These metrics are used to rank and prune pretraining corpora, and we subsequently compare LLMs trained on these pruned datasets. We find that perplexity outperforms other scoring methods and improves over our no-pruning baseline while training on as little as 30\% of the original training dataset. Our work sets a foundation for strategies in automatically curating high quality corpora and suggests that large amounts of pretraining data can be removed while retaining performance.

Chat is not available.