Automated Curation of Foundation-Scale Pretraining Datasets
Matthew Leavitt
Abstract
Large-scale models are what they eat: the quality and composition of their training data fundamentally determine their performance and behavior. Data curation is a frontier research and engineering challenge in deep learning, but cutting-edge techniques remain confined to large organizations with extensive in-house data teams. We developed a deployable, productionized data curation pipeline that integrates a suite of modular algorithms and scales efficiently to trillions of tokens. We share the results of applying our curation pipeline to generate state-of-the-art text and image-text datasets, demonstrating that scalable, high-quality data curation is accessible beyond the largest AI labs.
Successful Page Load