Timezone: »
Language Models (LMs) are pretrained on large unlabeled corpora through self-supervision tasks and have become ubiquitous to several NLP applications. Recent trends indicate that the generalization capability of Large LMs (LLMs) improves tremendously with increasing model capacity and size of the pretraining dataset. However, this also results in inefficiencies owing to higher training times, compute requirements and environmental impact. Previous works have mostly addressed the inefficiency concerns with respect to improving sample efficiency, architecture and training loss objective with little focus on data optimization. In this work, we explore if it is possible to use only highly informative subsets of the training data to train LLMs while maintaining their performance. We build upon the work done in informative data subset selection and propose INGENIOUS, a framework that selects highly representative subsets of the training corpus by optimizing a submodular function. We show INGENIOUS can be adopted for the scale of LLM training and empirically demonstrate that the proposed framework achieves ~99% of original BERT performance in about 35% of the original training time.
Author Information
H S V N S Kowndinya Renduchintala (Indian Institute of Technology Bombay)
Krishnateja Killamsetty (University of Texas, Dallas)
Sumit Bhatia (MDSR Lab, Adobe Systems)
Milan Aggarwal (IIT Delhi)
Ganesh Ramakrishnan (Indian Institute of Technology Bombay, Indian Institute of Technology Bombay)
Rishabh Iyer (University of Texas, Dallas)
Bio: Prof. Rishabh Iyer is currently an Assistant Professor at the University of Texas, Dallas, where he leads the CARAML Lab. He is also a Visiting Assistant Professor at the Indian Institute of Technology, Bombay. He completed his Ph.D. in 2015 from the University of Washington, Seattle. He is excited in making ML more efficient (both computational and labeling efficiency), robust, and fair. He has received the best paper award at Neural Information Processing Systems (NeurIPS/NIPS) in 2013, the International Conference of Machine Learning (ICML) in 2013, and an Honorable Mention at CODS-COMAD in 2021. He has also won a Microsoft Research Ph.D. Fellowship, a Facebook Ph.D. Fellowship, and the Yang Award for Outstanding Graduate Student from the University of Washington.
More from the Same Authors
-
2021 : A Nested Bi-level Optimization Framework for Robust Few Shot Learning »
Krishnateja Killamsetty · Changbin Li · Chen Zhao · Rishabh Iyer · Feng Chen -
2021 : Targeted Active Learning using Submodular Mutual Information for Imbalanced Medical Image Classification »
Suraj Kothawade · Lakshman Tamil · Rishabh Iyer -
2022 : AutoML for Climate Change: A Call to Action »
Renbo Tu · Nicholas Roberts · Vishak Prasad C · Sibasis Nayak · Paarth Jain · Frederic Sala · Ganesh Ramakrishnan · Ameet Talwalkar · Willie Neiswanger · Colin White -
2022 : TALISMAN: Targeted Active Learning for Object Detection with Rare Classes and Slices using Submodular Mutual Information »
Suraj Kothawade · Saikat Ghosh · Sumit Shekhar · Yu Xiang · Rishabh Iyer -
2023 Poster: Learning to Select a Subset of Training Examples to Generalize Efficient Model Training »
Eeshaan Jain · Tushar Nandy · Gaurav Aggarwal · Ashish Tendulkar · Rishabh Iyer · Abir De -
2023 Poster: When Do Neural Nets Outperform Boosted Trees on Tabular Data? »
Duncan McElfresh · Sujay Khandagale · Jonathan Valverde · Vishak Prasad C · Ganesh Ramakrishnan · Micah Goldblum · Colin White -
2022 Poster: ORIENT: Submodular Mutual Information Measures for Data Subset Selection under Distribution Shift »
Athresh Karanam · Krishnateja Killamsetty · Harsha Kokel · Rishabh Iyer -
2022 Poster: CyCLIP: Cyclic Contrastive Language-Image Pretraining »
Shashank Goel · Hritik Bansal · Sumit Bhatia · Ryan Rossi · Vishwa Vinay · Aditya Grover -
2022 Poster: AUTOMATA: Gradient Based Data Subset Selection for Compute-Efficient Hyper-parameter Tuning »
Krishnateja Killamsetty · Guttu Sai Abhishek · Aakriti Lnu · Ganesh Ramakrishnan · Alexandre Evfimievski · Lucian Popa · Rishabh Iyer -
2021 Poster: SIMILAR: Submodular Information Measures Based Active Learning In Realistic Scenarios »
Suraj Kothawade · Nathan Beck · Krishnateja Killamsetty · Rishabh Iyer -
2021 Poster: Learning to Select Exogenous Events for Marked Temporal Point Process »
Ping Zhang · Rishabh Iyer · Ashish Tendulkar · Gaurav Aggarwal · Abir De -
2021 Poster: RETRIEVE: Coreset Selection for Efficient and Robust Semi-Supervised Learning »
Krishnateja Killamsetty · Xujiang Zhao · Feng Chen · Rishabh Iyer -
2015 Poster: Submodular Hamming Metrics »
Jennifer Gillenwater · Rishabh K Iyer · Bethany Lusch · Rahul Kidambi · Jeffrey A Bilmes -
2015 Spotlight: Submodular Hamming Metrics »
Jennifer Gillenwater · Rishabh K Iyer · Bethany Lusch · Rahul Kidambi · Jeffrey A Bilmes -
2015 Poster: Mixed Robust/Average Submodular Partitioning: Fast Algorithms, Guarantees, and Applications »
Kai Wei · Rishabh K Iyer · Shengjie Wang · Wenruo Bai · Jeffrey A Bilmes -
2014 Poster: Learning Mixtures of Submodular Functions for Image Collection Summarization »
Sebastian Tschiatschek · Rishabh K Iyer · Haochen Wei · Jeffrey A Bilmes -
2013 Poster: Submodular Optimization with Submodular Cover and Submodular Knapsack Constraints »
Rishabh K Iyer · Jeffrey A Bilmes -
2013 Oral: Submodular Optimization with Submodular Cover and Submodular Knapsack Constraints »
Rishabh K Iyer · Jeffrey A Bilmes -
2013 Poster: Curvature and Optimal Algorithms for Learning and Minimizing Submodular Functions »
Rishabh K Iyer · Stefanie Jegelka · Jeffrey A Bilmes -
2012 Poster: Submodular Bregman Divergences with Applications »
Rishabh K Iyer · Jeffrey A Bilmes