Skip to yearly menu bar Skip to main content


Poster

DataComp-LM: In search of the next generation of training sets for language models

Amro Abbas · Alon Albalak · Kushal Arora · Hritik Bansal · Yonatan Bitton · Yair Carmon · Khyathi Chandu · Mayee Chen · Giannis Daras · Achal Dave · Alex Dimakis · Alaaeldin El-Nouby · Fartash Faghri · Alex Fang · Samir Yitzhak Gadre · Josh Gardner · Saurabh Garg · Dhruba Ghosh · Aaron Gokaslan · Dirk Groeneveld · Etash Guha · Suchin Gururangan · Reinhard Heckel · Cheng-Yu Hsieh · Gabriel Ilharco · Maor Ivgi · Jenia Jitsev · Matt Jordan · Sham Kakade · Sedrick Scott Keh · Maciej Kilian · Pang Wei Koh · Thomas Kollar · Jeffrey Li · Kyle Lo · Kalyani Marathe · Jean Mercat · Niklas Muennighoff · Marianna Nezhurina · Thao Nguyen · Sewoong Oh · Hadi Pouransari · Sarah Pratt · Sunny Sanyal · Ludwig Schmidt · Vaishaal Shankar · Rulin Shao · Georgios Smyrnis · Luca Soldaini · Shuran Song · Alexander Toshev · Igor Vasiljevic · Stephanie Wang · Mitchell Wortsman · Rui Xin · Luke Zettlemoyer · Hanlin Zhang · Jieyu Zhang


Abstract:

We introduce DataComp for Language Models, a testbed for controlled dataset experiments with the goal of improving language models.As part of DCLM, we provide a standardized corpus of 240T tokens extracted from Common Crawl, effective pretraining recipes based on the OpenLM framework, and a broad suite of 53 downstream evaluations.Participants in the DCLM benchmark can experiment with data curation strategies such as deduplication, filtering, and data mixing atmodel scales ranging from 412M to 7B parameters.As a baseline for DCLM, we conduct extensive experiments and find that model-based filtering is key to assembling a high-quality training set.The resulting dataset, DCLM-Baseline, enables training a 7B parameter language model from scratch to 63% 5-shot accuracy on MMLU with 2T training tokens.Compared to MAP-Neo, the previous state-of-the-art in open-data language models, DCLM-Baseline represents a 6 percentage point improvement on MMLU while being trained with half the compute.Our results highlight the importance of dataset design for training language models and offer a starting point for further research on data curation.

Live content is unavailable. Log in and register to view live content