Skip to yearly menu bar Skip to main content


Keynote Talk
in
Workshop: Efficient Natural Language and Speech Processing (Models, Training, and Inference)

NLP with Synthetic Text

Mohammad Norouzi


Abstract:

Synthetic data is successfully used to train powerful machine learning models for computer vision and robotics, thanks to the availability of high-fidelity graphics and physics-based simulation. But, can synthetic data be successfully used to improve natural language processing? In this talk, I will advocate for the use of large language models as a great source of synthetic text. I will review recent work on data augmentation for NLP and describe a general framework for NLP with synthetic text, called “Generate, Annotate, and Learn”. I will highlight a few key results on generating unlabeled text for improving semi-supervised learning and knowledge distillation, in addition to advancing GPT3-style few-shot learning.