In recent years, rapid progress has been made on the problem of single-channel sound separation using supervised training of deep neural networks. In such supervised approaches, a model is trained to predict the component sources from synthetic mixtures created by adding up isolated ground-truth sources. Reliance on this synthetic training data is problematic because good performance depends upon the degree of match between the training data and real-world audio, especially in terms of the acoustic conditions and distribution of sources. The acoustic properties can be challenging to accurately simulate, and the distribution of sound types may be hard to replicate. In this paper, we propose a completely unsupervised method, mixture invariant training (MixIT), that requires only single-channel acoustic mixtures. In MixIT, training examples are constructed by mixing together existing mixtures, and the model separates them into a variable number of latent sources, such that the separated sources can be remixed to approximate the original mixtures. We show that MixIT can achieve competitive performance compared to supervised methods on speech separation. Using MixIT in a semi-supervised learning setting enables unsupervised domain adaptation and learning from large amounts of real-world data without ground-truth source waveforms. In particular, we significantly improve reverberant speech separation performance by incorporating reverberant mixtures, train a speech enhancement system from noisy mixtures, and improve universal sound separation by incorporating a large amount of in-the-wild data.
Scott Wisdom (Google)
Efthymios Tzinis (University of Illinois at Urbana-Champaign)
Efthymios Tzinis is a PhD candidate in the Computer Science (CS) department at the University of Illinois Urbana-Champaign (UIUC) under the supervision of Prof. Paris Smaragdis. He is also working part-time as a student researcher for Google AI perception. His current research focuses on utilizing neural networks towards more generalizable and robust audio source separation. He is particularly interested in unsupervised ways of learning directly from mixtures of audio signals as well as making source separation neural architectures accessible for everyone. Efthymios also holds a diploma (Bachelor and MEng equivalent) in Electrical and Computer Engineering (ECE) from the National Technical University of Athens (NTUA). In his thesis, he investigated how nonlinear recurrence information from the phase space of phonemes and manifold learning could be utilized towards more efficient speech emotion recognition.
Hakan Erdogan (Google)
Ron Weiss (Google)
Kevin Wilson (Google)
John R. Hershey (Google)
Related Events (a corresponding poster, oral, or spotlight)
2020 Spotlight: Unsupervised Sound Separation Using Mixture Invariant Training »
Tue. Dec 8th 03:00 -- 03:10 AM Room Orals & Spotlights: Language/Audio Applications
More from the Same Authors
2020 : Self-Supervised Audio-Visual Separation of On-Screen Sounds from Unlabeled Videos »
2018 Poster: Transfer Learning from Speaker Verification to Multispeaker Text-To-Speech Synthesis »
Ye Jia · Yu Zhang · Ron Weiss · Quan Wang · Jonathan Shen · Fei Ren · zhifeng Chen · Patrick Nguyen · Ruoming Pang · Ignacio Lopez Moreno · Yonghui Wu