Timezone: »
There is a trend in the machine learning community to adopt self-supervised approaches to pre-train deep networks. Self-supervised learning utilizes proxy supervised learning tasks, for example, distinguishing parts of the input signal from distractors, or generating masked input segments conditioned on the unmasked ones, to obtain training data from unlabeled corpora. These approaches make it possible to use a tremendous amount of unlabeled data on the web to train large networks and solve complicated tasks. ELMo, BERT, and GPT in NLP are famous examples in this direction. Recently self-supervised approaches for speech and audio processing are also gaining attention. These approaches combine methods for utilizing no or partial labels, unpaired text and audio data, contextual text and video supervision, and signals from user interactions. Although the research direction of self-supervised learning is active in speech and audio processing, current works are limited to several problems such as automatic speech recognition, speaker identification, and speech translation, partially due to the diversity of modeling in various speech and audio processing problems. There is still much unexplored territory in the research direction for self-supervised learning.
This workshop will bring concentrated discussions on self-supervision for the field of speech and audio processing via several invited talks, oral and poster sessions with high-quality papers, and a panel of leading researchers from academia and industry. Alongside research work on new self-supervised methods, data, applications, and results, this workshop will call for novel work on understanding, analyzing, and comparing different self-supervision approaches for speech and audio processing. The workshop aims to:
- Review existing and inspire new self-supervised methods and results,
- Motivate the application of self-supervision approaches to more speech and audio processing problems in academia and industry, and encourage discussion amongst experts and practitioners from the two realms,
- Encourage works on studying methods for understanding learned representations, comparing different self-supervision methods and comparing self-supervision to other self-training as well as transfer learning methods that low-resource speech and audio processing have long utilized,
- Facilitate communication within the field of speech and audio processing (e.g., people who attend conferences such as INTERSPEECH and ICASSP) as well as between the field and the whole machine learning community for sharing knowledge, ideas, and data, and encourage future collaboration to inspire innovation in the field and the whole community.
Fri 6:50 a.m. - 7:00 a.m.
|
Opening remarks
(
Introduction
)
|
Hung-yi Lee 🔗 |
Fri 7:00 a.m. - 7:35 a.m.
|
Invited talk - A Broad Perspective into Self Supervised Learning for Speech Recognition
(
Invited talk
)
|
Bhuvana Ramabhadran 🔗 |
Fri 7:35 a.m. - 7:45 a.m.
|
Q&A for invited talk - 1
(
Q&A
)
|
🔗 |
Fri 7:45 a.m. - 8:20 a.m.
|
Invited talk - Multimodal Distant Supervision
(
Invited talk
)
SlidesLive Video » |
Mark Hasegawa-Johnson 🔗 |
Fri 8:20 a.m. - 8:30 a.m.
|
Q&A for invited talk - Multimodal Distant Supervision
(
Q&A
)
|
🔗 |
Fri 8:30 a.m. - 8:40 a.m.
|
Self-Supervised Learning using Contrastive Mixtures for Personalized Speech Enhancement
(
Contributed talk
)
SlidesLive Video » |
Aswin Sivaraman 🔗 |
Fri 8:40 a.m. - 8:50 a.m.
|
Self-supervised Pre-training Reduces Label Permutation Instability of Speech Separation
(
Contributed talk
)
SlidesLive Video » |
Sung-Feng Huang 🔗 |
Fri 8:50 a.m. - 9:00 a.m.
|
Augmentation adversarial training for self-supervised speaker recognition
(
Contributed talk
)
SlidesLive Video » |
Jaesung Huh 🔗 |
Fri 9:00 a.m. - 9:10 a.m.
|
Neural Composition: Learning to Generate from Multiple Models
(
Contributed talk
)
SlidesLive Video » |
Denis Filimonov 🔗 |
Fri 9:10 a.m. - 9:20 a.m.
|
Towards Semi-Supervised Semantics Understanding from Speech
(
Contributed talk
)
SlidesLive Video » |
Cheng-I Jeff Lai 🔗 |
Fri 9:20 a.m. - 9:30 a.m.
|
The Zero Resource Speech Benchmark 2021. Metrics and baselines for unsupervised spoken language modeling
(
Contributed talk
)
SlidesLive Video » |
Tu Anh Nguyen 🔗 |
Fri 9:30 a.m. - 9:45 a.m.
|
Q&A for contributed talks between 11:30 and 12:30
(
Q&A
)
|
🔗 |
Fri 9:45 a.m. - 10:00 a.m.
|
Break
|
🔗 |
Fri 10:00 a.m. - 10:35 a.m.
|
Invited talk - Speech Processing with Weak Supervision
(
Invited talk
)
SlidesLive Video » |
Dong Yu 🔗 |
Fri 10:35 a.m. - 10:45 a.m.
|
Q&A for invited talk - Speech Processing with Weak Supervision
(
Q&A
)
|
🔗 |
Fri 10:45 a.m. - 10:55 a.m.
|
Towards Localisation of Keywords in Speech Using Weak Supervision
(
Contributed talk
)
SlidesLive Video » |
Kayode Olaleye 🔗 |
Fri 10:55 a.m. - 11:05 a.m.
|
Text-Free Image-to-Speech Synthesis Using Learned Segmental Units
(
Contributed talk
)
SlidesLive Video » |
Wei-Ning Hsu 🔗 |
Fri 11:05 a.m. - 11:15 a.m.
|
Self-Supervised Audio-Visual Separation of On-Screen Sounds from Unlabeled Videos
(
Contributed talk
)
SlidesLive Video » |
Efthymios Tzinis 🔗 |
Fri 11:15 a.m. - 11:25 a.m.
|
Multi-Format Contrastive Learning of Audio Representations
(
Contributed talk
)
SlidesLive Video » |
Aaron van den Oord 🔗 |
Fri 11:25 a.m. - 11:40 a.m.
|
Q&A for contributed talks between 1:45 and 2:25
(
Q&A
)
|
🔗 |
Fri 11:40 a.m. - 11:55 a.m.
|
Break
|
🔗 |
Fri 11:55 a.m. - 12:30 p.m.
|
Invited talk - Underfitting and Uncertainty in Self-Supervised Predictive Models
(
Invited talk
)
|
Chelsea Finn 🔗 |
Fri 12:30 p.m. - 12:40 p.m.
|
Q&A for invited talk - Underfitting and Uncertainty in Self-Supervised Predictive Models
(
Q&A
)
|
🔗 |
Fri 12:40 p.m. - 1:15 p.m.
|
Invited talk - Towards robust self-supervised learning of speech representations
(
Invited talk
)
SlidesLive Video » |
Mirco Ravanelli 🔗 |
Fri 1:15 p.m. - 1:25 p.m.
|
Q&A for invited talk - Towards robust self-supervised learning of speech representations
(
Q&A
)
|
🔗 |
Fri 1:25 p.m. - 1:35 p.m.
|
Similarity Analysis of Self-Supervised Speech Representations
(
Contributed talk
)
SlidesLive Video » |
Yu-An Chung 🔗 |
Fri 1:35 p.m. - 1:45 p.m.
|
Representation Learning for Sequence Data with Deep Autoencoding Predictive
(
Contributed talk
)
SlidesLive Video » |
Junwen Bai 🔗 |
Fri 1:45 p.m. - 1:55 p.m.
|
Pushing the Limits of Semi-Supervised Learning for Automatic Speech Recognition
(
Contributed talk
)
SlidesLive Video » |
Yu Zhang 🔗 |
Fri 1:55 p.m. - 2:05 p.m.
|
A Correspondence Variational Autoencoder for Unsupervised Acoustic Word Embedding
(
Contributed talk
)
SlidesLive Video » |
Puyuan Peng 🔗 |
Fri 2:05 p.m. - 2:15 p.m.
|
HUBERT: How much can a bad teacher benefit ASR pre-training?
(
Contributed talk
)
SlidesLive Video » |
Wei-Ning Hsu 🔗 |
Fri 2:15 p.m. - 2:30 p.m.
|
Q&A for contributed talks between 4:25 and 5:15
(
Q&A
)
|
🔗 |
Fri 2:30 p.m. - 2:45 p.m.
|
Break
|
🔗 |
Fri 2:45 p.m. - 3:20 p.m.
|
Invited talk - Flexible contextualized speech representation learning for diverse downstream tasks
(
Invited talk
)
SlidesLive Video » |
Katrin Kirchhhoff 🔗 |
Fri 3:20 p.m. - 3:30 p.m.
|
Q&A for invited talk - Flexible contextualized speech representation learning for diverse downstream tasks
(
Q&A
)
|
🔗 |
Fri 3:30 p.m. - 4:05 p.m.
|
Invited talk - De-noising Sequence-to-Sequence Pre-training
(
Invited talk
)
SlidesLive Video » De-noising auto-encoders can be pre-trained at a very large scale by noising and then reconstructing any input text. Existing methods, based on variations of masked languages models, have transformed the field and now provide the de facto initialization to be tuned for nearly every task. In this talk, I will present our work on sequence-to-sequence pre-training that introduces and carefully measures the impact of two new types of noising strategies. I will fist describe an approach that allows arbitrary noising, by learning to translate any corrupted text back to the original with standard Transformer-based neural machine translation architectures. I will show that the resulting mono-lingual (BART) and multi-lingual (mBART) models provide effective initialization for learning a wide range of discrimination and generation tasks, including question answer, summarization, and machine translation. I will also present our recently introduced MARGE model, where we self-supervise the reconstruction of target text by retrieving a set of related texts (in many languages) and conditioning on them to maximize the likelihood of generating the original. The objective noisily captures aspects of paraphrase, translation, multi-document summarization, and information retrieval, allowing for strong zero-shot performance with no fine-tuning, as well as consistent performance gain when fine tuned for individual tasks. Together, these techniques provide the most comprehensive set of pre-training methods to date, as well as the first viable alternative to the dominant masked language modeling pre-training paradigm. |
Luke Zettlemoyer 🔗 |
Fri 4:05 p.m. - 4:15 p.m.
|
Q&A for invited talk - De-noising Sequence-to-Sequence Pre-training
(
Q&A
)
|
🔗 |
Fri 4:15 p.m. - 4:25 p.m.
|
Closing remark
(
Introduction
)
|
Abdelrahman Mohamed 🔗 |
Author Information
Abdelrahman Mohamed (Facebook AI Research (FAIR))
Hung-yi Lee (National Taiwan University)
Shinji Watanabe (Johns Hopkins University)
Shang-Wen Li (Amazon)
Tara Sainath (Google)
Karen Livescu (TTI-Chicago)
More from the Same Authors
-
2022 : On Convexity and Linear Mode Connectivity in Neural Networks »
David Yunis · Kumar Kshitij Patel · Pedro Savarese · Gal Vardi · Jonathan Frankle · Matthew Walter · Karen Livescu · Michael Maire -
2023 : Keynote Talk 5 »
Tara Sainath -
2022 : Last Advances in End-to-End Speech Recognition »
Tara Sainath -
2021 Workshop: 2nd Workshop on Self-Supervised Learning: Theory and Practice »
Pengtao Xie · Ishan Misra · Pulkit Agrawal · Abdelrahman Mohamed · Shentong Mo · Youwei Liang · Jeannette Bohg · Kristina N Toutanova -
2021 : HEAR 2021: Holistic Evaluation of Audio Representations + Q&A »
Joseph Turian · Jordan Shier · Bhiksha Raj · Bjoern Schuller · Christian Steinmetz · George Tzanetakis · Gissel Velarde · Kirk McNally · Max Henry · Nicolas Pinto · Yonatan Bisk · George Tzanetakis · Camille Noufi · Dorien Herremans · Jesse Engel · Justin Salamon · Prany Manocha · Philippe Esling · Shinji Watanabe -
2021 : Live Q&A session: Hung-Yi Lee (National Taiwan University) »
Hung-yi Lee -
2021 : Invited Talk: Hung-Yi Lee (National Taiwan University) »
Hung-yi Lee -
2020 Workshop: Self-Supervised Learning -- Theory and Practice »
Pengtao Xie · Shanghang Zhang · Pulkit Agrawal · Ishan Misra · Cynthia Rudin · Abdelrahman Mohamed · Wenzhen Yuan · Barret Zoph · Laurens van der Maaten · Xingyi Yang · Eric Xing -
2020 : Closing remark »
Abdelrahman Mohamed -
2020 : Opening remarks »
Hung-yi Lee -
2020 Poster: wav2vec 2.0: A Framework for Self-Supervised Learning of Speech Representations »
Alexei Baevski · Yuhao Zhou · Abdelrahman Mohamed · Michael Auli -
2020 Poster: Sequence to Multi-Sequence Learning via Conditional Chain Mapping for Mixture Signals »
Jing Shi · Xuankai Chang · Pengcheng Guo · Shinji Watanabe · Yusuke Fujita · Jiaming Xu · Bo Xu · Lei Xie -
2020 Poster: TaylorGAN: Neighbor-Augmented Policy Update Towards Sample-Efficient Natural Language Generation »
Chun-Hsing Lin · Siang-Ruei Wu · Hung-yi Lee · Yun-Nung Chen -
2017 : Panel: Machine learning and audio signal processing: State of the art and future perspectives »
Sepp Hochreiter · Bo Li · Karen Livescu · Arindam Mandal · Oriol Nieto · Malcolm Slaney · Hendrik Purwins -
2017 : One Minute Poster Spotlight »
Tiancheng Zhao · Chandra khatri · Chaitanya K. Joshi · Maryam Fazel-Zarandi · Arpit Gupta · Shang-Wen Li · Amanda Cercas Curry · Behnam Hedayatnia · Li Zhou · Anushree Venkatesh · Fei Mi · Ming Cheng -
2017 : Acoustic word embeddings for speech search »
Karen Livescu -
2015 Poster: Structured Transforms for Small-Footprint Deep Learning »
Vikas Sindhwani · Tara Sainath · Sanjiv Kumar -
2015 Spotlight: Structured Transforms for Small-Footprint Deep Learning »
Vikas Sindhwani · Tara Sainath · Sanjiv Kumar