Timezone: »
Speech language models (SpeechLMs) process and generate acoustic data only, without textual supervision. In this work, we propose TWIST, a method for training SpeechLMs using a warm-start from a pretrained textual language models. We show using both automatic and human evaluations that TWIST outperforms a cold-start SpeechLM across the board. We empirically analyze the effect of different model design choices such as the speech tokenizer, the pretrained textual model, and the dataset size. We find that model and dataset scale both play an important role in constructing better-performing SpeechLMs. Based on our observations, we present the largest (to the best of our knowledge) SpeechLM both in terms of number of parameters and training data. We additionally introduce two spoken versions of the StoryCloze textual benchmark to further improve model evaluation and advance future research in the field. We make speech samples, code and models publicly available.
Author Information
Michael Hassid (Hebrew University, Meta AI (FAIR))
Tal Remez (Meta)
Tu Anh Nguyen (INRIA, Paris, France)
Itai Gat (Technion)
Alexis CONNEAU (Facebook)
Felix Kreuk (Bar-Ilan University)
Jade Copet (FAIR, Meta)
Alexandre Defossez (Facebook)
Gabriel Synnaeve (Facebook)
Emmanuel Dupoux (Facebook)
Roy Schwartz (The Hebrew University of Jerusalem)
Yossi Adi (The Hebrew University of Jerusalem)
More from the Same Authors
-
2023 : Temperature-scaled large language models for Lean proofstep prediction »
Fabian Gloeckle · Baptiste Roziere · Amaury Hayat · Gabriel Synnaeve -
2023 Poster: Simple and Controllable Music Generation »
Jade Copet · Felix Kreuk · Itai Gat · Tal Remez · Gabriel Synnaeve · Yossi Adi · Alexandre Defossez -
2023 Poster: Voicebox: Text-Guided Multilingual Universal Speech Generation at Scale »
Matthew Le · Apoorv Vyas · Bowen Shi · Brian Karrer · Leda Sari · Rashel Moritz · Mary Williamson · Vimal Manohar · Yossi Adi · Jay Mahadeokar · Wei-Ning Hsu -
2023 Poster: From Discrete Tokens to High-Fidelity Audio Using Multi-Band Diffusion »
Robin San Roman · Yossi Adi · Antoine Deleforge · Romain Serizel · Gabriel Synnaeve · Alexandre Defossez -
2022 Poster: Emergent Communication: Generalization and Overfitting in Lewis Games »
Mathieu Rita · Corentin Tallec · Paul Michel · Jean-Bastien Grill · Olivier Pietquin · Emmanuel Dupoux · Florian Strub -
2022 Poster: Star Temporal Classification: Sequence Modeling with Partially Labeled Data »
Vineel Pratap · Awni Hannun · Gabriel Synnaeve · Ronan Collobert -
2022 Poster: On the Importance of Gradient Norm in PAC-Bayesian Bounds »
Itai Gat · Yossi Adi · Alex Schwing · Tamir Hazan -
2022 Poster: WinoGAViL: Gamified Association Benchmark to Challenge Vision-and-Language Models »
Yonatan Bitton · Nitzan Bitton Guetta · Ron Yosef · Yuval Elovici · Mohit Bansal · Gabriel Stanovsky · Roy Schwartz -
2021 Poster: Hierarchical Skills for Efficient Exploration »
Jonas Gehring · Gabriel Synnaeve · Andreas Krause · Nicolas Usunier -
2021 : Enhanced Zero-Resource Speech Challenge 2021: Language Modelling from Speech and Images + Q&A »
Ewan Dunbar · Alejandrina Cristia · Okko Räsänen · Bertrand Higy · Marvin Lavechin · Grzegorz Chrupała · Afra Alishahi · Chen Yu · Maureen De Seyssel · Tu Anh Nguyen · Mathieu Bernard · Nicolas Hamilakis · Emmanuel Dupoux -
2021 Poster: CAPE: Encoding Relative Positions with Continuous Augmented Positional Embeddings »
Tatiana Likhomanenko · Qiantong Xu · Gabriel Synnaeve · Ronan Collobert · Alex Rogozhnikov -
2021 Poster: XCiT: Cross-Covariance Image Transformers »
Alaaeldin Ali · Hugo Touvron · Mathilde Caron · Piotr Bojanowski · Matthijs Douze · Armand Joulin · Ivan Laptev · Natalia Neverova · Gabriel Synnaeve · Jakob Verbeek · Herve Jegou -
2021 Poster: Unsupervised Speech Recognition »
Alexei Baevski · Wei-Ning Hsu · Alexis CONNEAU · Michael Auli -
2021 Oral: Unsupervised Speech Recognition »
Alexei Baevski · Wei-Ning Hsu · Alexis CONNEAU · Michael Auli -
2021 Poster: Perceptual Score: What Data Modalities Does Your Model Perceive? »
Itai Gat · Idan Schwartz · Alex Schwing -
2020 : The Zero Resource Speech Benchmark 2021. Metrics and baselines for unsupervised spoken language modeling »
Tu Anh Nguyen -
2020 Poster: Removing Bias in Multi-modal Classifiers: Regularization by Maximizing Functional Entropies »
Itai Gat · Idan Schwartz · Alex Schwing · Tamir Hazan -
2020 Poster: A causal view of compositional zero-shot recognition »
Yuval Atzmon · Felix Kreuk · Uri Shalit · Gal Chechik -
2020 Spotlight: A causal view of compositional zero-shot recognition »
Yuval Atzmon · Felix Kreuk · Uri Shalit · Gal Chechik -
2019 Poster: Cross-lingual Language Model Pretraining »
Alexis CONNEAU · Guillaume Lample -
2019 Spotlight: Cross-lingual Language Model Pretraining »
Alexis CONNEAU · Guillaume Lample -
2019 Poster: Anti-efficient encoding in emergent communication »
Rahma Chaabouni · Eugene Kharitonov · Emmanuel Dupoux · Marco Baroni -
2019 Poster: A Structured Prediction Approach for Generalization in Cooperative Multi-Agent Reinforcement Learning »
Nicolas Carion · Nicolas Usunier · Gabriel Synnaeve · Alessandro Lazaric -
2019 Spotlight: A Structured Prediction Approach for Generalization in Cooperative Multi-Agent Reinforcement Learning »
Nicolas Carion · Nicolas Usunier · Gabriel Synnaeve · Alessandro Lazaric -
2018 : Accepted papers »
Sven Gowal · Bogdan Kulynych · Marius Mosbach · Nicholas Frosst · Phil Roth · Utku Ozbulak · Simral Chaudhary · Toshiki Shibahara · Salome Viljoen · Nikita Samarin · Briland Hitaj · Rohan Taori · Emanuel Moss · Melody Guan · Lukas Schott · Angus Galloway · Anna Golubeva · Xiaomeng Jin · Felix Kreuk · Akshayvarun Subramanya · Vipin Pillai · Hamed Pirsiavash · Giuseppe Ateniese · Ankita Kalra · Logan Engstrom · Anish Athalye -
2018 Workshop: Modeling the Physical World: Learning, Perception, and Control »
Jiajun Wu · Kelsey Allen · Kevin Smith · Jessica Hamrick · Emmanuel Dupoux · Marc Toussaint · Josh Tenenbaum -
2018 Poster: Forward Modeling for Partial Observation Strategy Games - A StarCraft Defogger »
Gabriel Synnaeve · Zeming Lin · Jonas Gehring · Dan Gant · Vegard Mella · Vasil Khalidov · Nicolas Carion · Nicolas Usunier -
2018 Poster: SING: Symbol-to-Instrument Neural Generator »
Alexandre Defossez · Neil Zeghidour · Nicolas Usunier · Leon Bottou · Francis Bach -
2018 Poster: Out-of-Distribution Detection using Multiple Semantic Label Representations »
Gabi Shalev · Yossi Adi · Joseph Keshet -
2017 Poster: Houdini: Fooling Deep Structured Visual and Speech Recognition Models with Adversarial Examples »
Moustapha Cisse · Yossi Adi · Natalia Neverova · Joseph Keshet -
2016 : Datasets, Methodology, and Challenges in Intuitive Physics »
Emmanuel Dupoux · Josh Tenenbaum -
2016 : Naive Physics 101: A Tutorial »
Emmanuel Dupoux · Josh Tenenbaum -
2016 Workshop: Intuitive Physics »
Adam Lerer · Jiajun Wu · Josh Tenenbaum · Emmanuel Dupoux · Rob Fergus