Poster
in
Workshop: Third Workshop on Efficient Natural Language and Speech Processing (ENLSP-III): Towards the Future of Large Language Models and their Emerging Descendants

ASR Data Selection from Multiple Sources: A Practical Approach on Performance Scaling

Hoang Anh Just · I-Fan Chen · Feiyang Kang · Yuanzhi Zhang · Anit Kumar Sahu · Ruoxi Jia

Abstract

This paper proposes a framework leveraging small samples from different Automatic Speech Recognition~(ASR) data sources to predict model performance and facilitate ASR data selection decisions. By utilizing data distribution distance and a mapping technique inspired by neural scaling laws, our framework estimates the model performance for various data mixtures within the disclosed range and extrapolates it onto much larger target data sizes. This is the first study on extending this novel approach to ASR problems. Experiments conducted on the Librispeech and the TED-LIUM3 datasets confirm the effectiveness of the proposed data selection framework. Compared to a heuristic-based selection baseline, our framework consistently demonstrates 13~17% relative word error rate reductions under 40$/ $50$/ $100-hour fine-tuning data hour budgets.

Chat is not available.