Timezone: »

Using Informative Data Subsets for Efficient Training of Large Language Models: An Initial Study
H S V N S Kowndinya Renduchintala · Krishnateja Killamsetty · Sumit Bhatia · Milan Aggarwal · Ganesh Ramakrishnan · Rishabh Iyer

Language Models (LMs) are pretrained on large unlabeled corpora through self-supervision tasks and have become ubiquitous to several NLP applications. Recent trends indicate that the generalization capability of Large LMs (LLMs) improves tremendously with increasing model capacity and size of the pretraining dataset. However, this also results in inefficiencies owing to higher training times, compute requirements and environmental impact. Previous works have mostly addressed the inefficiency concerns with respect to improving sample efficiency, architecture and training loss objective with little focus on data optimization. In this work, we explore if it is possible to use only highly informative subsets of the training data to train LLMs while maintaining their performance. We build upon the work done in informative data subset selection and propose INGENIOUS, a framework that selects highly representative subsets of the training corpus by optimizing a submodular function. We show INGENIOUS can be adopted for the scale of LLM training and empirically demonstrate that the proposed framework achieves ~99% of original BERT performance in about 35% of the original training time.

Author Information

H S V N S Kowndinya Renduchintala (Indian Institute of Technology Bombay)
Krishnateja Killamsetty (University of Texas, Dallas)
Sumit Bhatia (MDSR Lab, Adobe Systems)
Milan Aggarwal (IIT Delhi)
Ganesh Ramakrishnan (Indian Institute of Technology Bombay, Indian Institute of Technology Bombay)
Rishabh Iyer (University of Texas, Dallas)

Bio: Prof. Rishabh Iyer is currently an Assistant Professor at the University of Texas, Dallas, where he leads the CARAML Lab. He is also a Visiting Assistant Professor at the Indian Institute of Technology, Bombay. He completed his Ph.D. in 2015 from the University of Washington, Seattle. He is excited in making ML more efficient (both computational and labeling efficiency), robust, and fair. He has received the best paper award at Neural Information Processing Systems (NeurIPS/NIPS) in 2013, the International Conference of Machine Learning (ICML) in 2013, and an Honorable Mention at CODS-COMAD in 2021. He has also won a Microsoft Research Ph.D. Fellowship, a Facebook Ph.D. Fellowship, and the Yang Award for Outstanding Graduate Student from the University of Washington.

More from the Same Authors