Timezone: »
The current trend of scaling language models involves increasing both parameter count and training dataset size. Extrapolating this trend suggests that training dataset size may soon be limited by the amount of text data available on the internet. Motivated by this limit, we investigate scaling language models in data-constrained regimes. Specifically, we run a large set of experiments varying the extent of data repetition and compute budget, ranging up to 900 billion training tokens and 9 billion parameter models. We find that with constrained data for a fixed compute budget, training with up to 4 epochs of repeated data yields negligible changes to loss compared to having unique data. However, with more repetition, the value of adding compute eventually decays to zero. We propose and empirically validate a scaling law for compute optimality that accounts for the decreasing value of repeated tokens and excess parameters. Finally, we experiment with approaches mitigating data scarcity, including augmenting the training dataset with code data or removing commonly used filters. Models and datasets from our 400 training runs are freely available at https://github.com/huggingface/datablations.
Author Information
Niklas Muennighoff (Hugging Face)
Alexander Rush (Cornell University)

Alexander "Sasha" Rush is an Associate Professor at Cornell Tech and a researcher at Hugging Face. His research interest is in the study of language models with applications in controllable text generation, efficient inference, and applications in summarization and information extraction. In addition to research, he has written several popular open-source software projects supporting NLP research, programming for deep learning, and virtual academic conferences. His projects have received paper and demo awards at major NLP, visualization, and hardware conferences, an NSF Career Award and Sloan Fellowship. He tweets at @srush_nlp.
Boaz Barak (Harvard University)
Teven Le Scao (Mistral)
Nouamane Tazi (Hugging Face)
Aleksandra Piktus (Cohere)
Sampo Pyysalo (University of Turku)
Thomas Wolf (HuggingFace Inc.)
Colin Raffel (UNC Chapel Hill and Hugging Face)
Related Events (a corresponding poster, oral, or spotlight)
-
2023 Poster: Scaling Data-Constrained Language Models »
Tue. Dec 12th through Wed the 13th Room Great Hall & Hall B1+B2 #813
More from the Same Authors
-
2021 : The impact of domain shift on the calibration of fine-tuned models »
Jay Mohta · Colin Raffel -
2022 : Deconstructing Distributions: A Pointwise Framework of Learning »
Gal Kaplun · Nikhil Ghosh · Saurabh Garg · Boaz Barak · Preetum Nakkiran -
2022 : Models with Conditional Computation Learn Suboptimal Solutions »
Mohammed Muqeeth · Haokun Liu · Colin Raffel -
2023 : Conditional Generation of Antigen Specific T-cell Receptor Sequences »
Dhuvarakesh Karthikeyan · Colin Raffel · Benjamin Vincent · Alex Rubinsteyn -
2023 : Diffusion Models without Attention »
Nathan Yan · Jiatao Gu · Alexander Rush -
2023 : Improving Few-Shot Generalization by Exploring and Exploiting Auxiliary Data »
Alon Albalak · Colin Raffel · William Yang Wang -
2023 : Efficient Online Data Mixing For Language Model Pre-Training »
Alon Albalak · Liangming Pan · Colin Raffel · William Yang Wang -
2023 : Efficient Online Data Mixing For Language Model Pre-Training »
Alon Albalak · Liangming Pan · Colin Raffel · William Yang Wang -
2023 : OctoPack: Instruction Tuning Code Large Language Models »
Niklas Muennighoff · Qian Liu · Armel Zebaze · Qinkai Zheng · Binyuan Hui · Terry Yue Zhuo · Swayam Singh · Xiangru Tang · Leandro Von Werra · Shayne Longpre -
2023 : Efficient Online Data Mixing For Language Model Pre-Training »
Alon Albalak · Liang-Ming Pan · Colin Raffel · William Yang Wang -
2023 Poster: TIES-Merging: Resolving Interference When Merging Models »
Prateek Yadav · Derek Tam · Leshem Choshen · Colin Raffel · Mohit Bansal -
2023 Poster: OBELICS: An Open Web-Scale Filtered Dataset of Interleaved Image-Text Documents »
Hugo Laurençon · Lucile Saulnier · Leo Tronchon · Stas Bekman · Amanpreet Singh · Anton Lozhkov · Thomas Wang · Siddharth Karamcheti · Alexander Rush · Douwe Kiela · Matthieu Cord · Victor Sanh -
2023 Poster: Distributed Inference and Fine-tuning of Large Language Models Over The Internet »
Alexander Borzunov · Max Ryabinin · Artem Chumachenko · Dmitry Baranchuk · Tim Dettmers · Younes Belkada · Pavel Samygin · Colin Raffel -
2023 Invited Talk: Beyond Scaling »
Alexander Rush · Aakanksha Chowdhery · Angela Fan · Percy Liang · Jie Tang -
2023 Poster: Improving Few-Shot Generalization by Exploring and Exploiting Auxiliary Data »
Alon Albalak · Colin Raffel · William Yang Wang -
2022 : BigScience: A Case Study in the Social Construction of a Multilingual Large Language Model »
Christopher Akiki · Giada Pistilli · Margot Mieskes · Matthias Gallé · Thomas Wolf · Suzana Ilic · Yacine Jernite -
2022 : Petals: Collaborative Inference and Fine-tuning of Large Models »
Alexander Borzunov · Dmitry Baranchuk · Tim Dettmers · Max Ryabinin · Younes Belkada · Artem Chumachenko · Pavel Samygin · Colin Raffel -
2022 : BigScience: A Case Study in the Social Construction of a Multilingual Large Language Model »
Christopher Akiki · Giada Pistilli · Margot Mieskes · Matthias Gallé · Thomas Wolf · Suzana Ilic · Yacine Jernite -
2022 : Petals: Collaborative Inference and Fine-tuning of Large Models »
Alexander Borzunov · Dmitry Baranchuk · Tim Dettmers · Max Ryabinin · Younes Belkada · Artem Chumachenko · Pavel Samygin · Colin Raffel -
2022 Workshop: Transfer Learning for Natural Language Processing »
Alon Albalak · Colin Raffel · Chunting Zhou · Deepak Ramachandran · Xuezhe Ma · Sebastian Ruder -
2022 Poster: Compositional Generalization in Unsupervised Compositional Representation Learning: A Study on Disentanglement and Emergent Language »
Zhenlin Xu · Marc Niethammer · Colin Raffel -
2022 Poster: A Combinatorial Perspective on the Optimization of Shallow ReLU Networks »
Michael S Matena · Colin Raffel -
2022 Poster: Hidden Progress in Deep Learning: SGD Learns Parities Near the Computational Limit »
Boaz Barak · Benjamin Edelman · Surbhi Goel · Sham Kakade · Eran Malach · Cyril Zhang -
2022 Poster: Merging Models with Fisher-Weighted Averaging »
Michael S Matena · Colin Raffel -
2022 Poster: Few-Shot Parameter-Efficient Fine-Tuning is Better and Cheaper than In-Context Learning »
Haokun Liu · Derek Tam · Mohammed Muqeeth · Jay Mohta · Tenghao Huang · Mohit Bansal · Colin Raffel -
2022 Poster: The BigScience ROOTS Corpus: A 1.6TB Composite Multilingual Dataset »
Hugo Laurençon · Lucile Saulnier · Thomas Wang · Christopher Akiki · Albert Villanova del Moral · Teven Le Scao · Leandro Von Werra · Chenghao Mou · Eduardo González Ponferrada · Huu Nguyen · Jörg Frohberg · Mario Šaško · Quentin Lhoest · Angelina McMillan-Major · Gerard Dupont · Stella Biderman · Anna Rogers · Loubna Ben allal · Francesco De Toni · Giada Pistilli · Olivier Nguyen · Somaieh Nikpoor · Maraim Masoud · Pierre Colombo · Javier de la Rosa · Paulo Villegas · Tristan Thrush · Shayne Longpre · Sebastian Nagel · Leon Weber · Manuel Muñoz · Jian Zhu · Daniel Van Strien · Zaid Alyafeai · Khalid Almubarak · Minh Chien Vu · Itziar Gonzalez-Dios · Aitor Soroa · Kyle Lo · Manan Dey · Pedro Ortiz Suarez · Aaron Gokaslan · Shamik Bose · David Adelani · Long Phan · Hieu Tran · Ian Yu · Suhas Pai · Jenny Chim · Violette Lepercq · Suzana Ilic · Margaret Mitchell · Sasha Luccioni · Yacine Jernite -
2021 Poster: Revisiting Model Stitching to Compare Neural Representations »
Yamini Bansal · Preetum Nakkiran · Boaz Barak -
2021 Poster: Training Neural Networks with Fixed Sparse Masks »
Yi-Lin Sung · Varun Nair · Colin Raffel -
2020 : Responsible publication: NLP case study »
Miles Brundage · Bryan McCann · Colin Raffel · Natalie Schulter · Zeerak Waseem · Rosie Campbell -
2020 Poster: Retrieval-Augmented Generation for Knowledge-Intensive NLP Tasks »
Patrick Lewis · Ethan Perez · Aleksandra Piktus · Fabio Petroni · Vladimir Karpukhin · Naman Goyal · Heinrich Küttler · Mike Lewis · Wen-tau Yih · Tim Rocktäschel · Sebastian Riedel · Douwe Kiela -
2019 Poster: SGD on Neural Networks Learns Functions of Increasing Complexity »
Dimitris Kalimeris · Gal Kaplun · Preetum Nakkiran · Benjamin Edelman · Tristan Yang · Boaz Barak · Haofeng Zhang -
2019 Spotlight: SGD on Neural Networks Learns Functions of Increasing Complexity »
Dimitris Kalimeris · Gal Kaplun · Preetum Nakkiran · Benjamin Edelman · Tristan Yang · Boaz Barak · Haofeng Zhang -
2019 Poster: (Nearly) Efficient Algorithms for the Graph Matching Problem on Correlated Random Graphs »
Boaz Barak · Chi-Ning Chou · Zhixian Lei · Tselil Schramm · Yueqi Sheng -
2018 : The Conversational Intelligence Challenge 2 (ConvAI2) : Winners talks & spotlights »
Thomas Wolf · Xuezheng Peng · Christian Saam · Henry Elder · Rauf Kurbanov · Mohammad Shadab Alam