Skip to yearly menu bar Skip to main content

Lightning Talk
Workshop: Data Centric AI

Small Data in NLU: Proposals towards a Data-Centric Approach


Domain-specific voice assistants often suffer from the problem of data scarcity. Publicly available, annotated datasets are in short supply and rarely fit the domain and the language required by a specific use case. Insufficient attention to data quality can generally be problematic when it comes to training and evaluation. The Computational Linguistics (CL) community has gained expertise and developed best practices for high-quality data annotation and collection as well as for for qualitative data analysis. However, the recent model-centric focus in AI and ML has not created ideal conditions for a fruitful collaboration with CL and the more data-centric fields of NLP to tackle data quality issues. We showcase principles and methods from CL / NLP research, which can potentially guide the development of data-centric NLU for domain-specific voice assistants - but have been typically overlooked by common practices in ML / AI. Those principles can potentially be of help to shape data-centric practices also for other domains. We argue that paying more attention to data quality and domain specificity can go a long way in improving the NLU components of today’s voice assistants