Timezone: »

 
Poster
Multimodal and Multilingual Embeddings for Large-Scale Speech Mining
Paul-Ambroise Duquenne · Hongyu Gong · Holger Schwenk

Wed Dec 08 12:30 AM -- 02:00 AM (PST) @

We present an approach to encode a speech signal into a fixed-size representation which minimizes the cosine loss with the existing massively multilingual LASER text embedding space. Sentences are close in this embedding space, independently of their language and modality, either text or audio. Using a similarity metric in that multimodal embedding space, we perform mining of audio in German, French, Spanish and English from Librivox against billions of sentences from Common Crawl. This yielded more than twenty thousand hours of aligned speech translations. To evaluate the automatically mined speech/text corpora, we train neural speech translation systems for several languages pairs. Adding the mined data, achieves significant improvements in the BLEU score on the CoVoST2 and the MUST-C test sets with respect to a very competitive baseline. Our approach can also be used to directly perform speech-to-speech mining, without the need to first transcribe or translate the data. We obtain more than one thousand three hundred hours of aligned speech in French, German, Spanish and English. This speech corpus has the potential to boost research in speech-to-speech translation which suffers from scarcity of natural end-to-end training data. All the mined multimodal corpora will be made freely available.

Author Information

Paul-Ambroise Duquenne (Facebook)
Hongyu Gong (Facebook AI Research)

Hongyu is a research scientist at Facebook AI Research with a focus on speech and text translation. Her research interests span the areas of language representation learning and language generation. She obtained her PhD from the University of Illinois at Urbana-Champaign in 2020.

Holger Schwenk (Université of Le Mans)

Related Events (a corresponding poster, oral, or spotlight)

More from the Same Authors