Skip to yearly menu bar Skip to main content

Workshop: Second Workshop on Efficient Natural Language and Speech Processing (ENLSP-II)

Using Informative Data Subsets for Efficient Training of Large Language Models: An Initial Study

H S V N S Kowndinya Renduchintala · Krishnateja Killamsetty · Sumit Bhatia · Milan Aggarwal · Ganesh Ramakrishnan · Rishabh Iyer

Keywords: [ ENLSP-Main ]


Language Models (LMs) are pretrained on large unlabeled corpora through self-supervision tasks and have become ubiquitous to several NLP applications. Recent trends indicate that the generalization capability of Large LMs (LLMs) improves tremendously with increasing model capacity and size of the pretraining dataset. However, this also results in inefficiencies owing to higher training times, compute requirements and environmental impact. Previous works have mostly addressed the inefficiency concerns with respect to improving sample efficiency, architecture and training loss objective with little focus on data optimization. In this work, we explore if it is possible to use only highly informative subsets of the training data to train LLMs while maintaining their performance. We build upon the work done in informative data subset selection and propose INGENIOUS, a framework that selects highly representative subsets of the training corpus by optimizing a submodular function. We show INGENIOUS can be adopted for the scale of LLM training and empirically demonstrate that the proposed framework achieves ~99% of original BERT performance in about 35% of the original training time.

Chat is not available.