Timezone: »
Recently, text-to-speech (TTS) models such as FastSpeech and ParaNet have been proposed to generate mel-spectrograms from text in parallel. Despite the advantage, the parallel TTS models cannot be trained without guidance from autoregressive TTS models as their external aligners. In this work, we propose Glow-TTS, a flow-based generative model for parallel TTS that does not require any external aligner. By combining the properties of flows and dynamic programming, the proposed model searches for the most probable monotonic alignment between text and the latent representation of speech on its own. We demonstrate that enforcing hard monotonic alignments enables robust TTS, which generalizes to long utterances, and employing generative flows enables fast, diverse, and controllable speech synthesis. Glow-TTS obtains an order-of-magnitude speed-up over the autoregressive model, Tacotron 2, at synthesis with comparable speech quality. We further show that our model can be easily extended to a multi-speaker setting.
Author Information
Jaehyeon Kim (Kakao Enterprise)
Sungwon Kim (Seoul National University)
Jungil Kong (Kakao Enterprise)
Sungroh Yoon (Seoul National University)
Dr. Sungroh Yoon is Associate Professor of Electrical and Computer Engineering at Seoul National University, Korea. Prof. Yoon received the B.S. degree from Seoul National University, South Korea, and the M.S. and Ph.D. degrees from Stanford University, CA, respectively, all in electrical engineering. He held research positions with Stanford University, CA, Intel Corporation, Santa Clara, CA, and Synopsys, Inc., Mountain View, CA. He was an Assistant Professor with the School of Electrical Engineering, Korea University, from 2007 to 2012. He is currently an Associate Professor with the Department of Electrical and Computer Engineering, Seoul National University, South Korea. Prof. Yoon is the recipient of 2013 IEEE/IEIE Joint Award for Young IT Engineers. His research interests include deep learning, machine learning, data-driven artificial intelligence, and large-scale applications including biomedicine.
Related Events (a corresponding poster, oral, or spotlight)
-
2020 Oral: Glow-TTS: A Generative Flow for Text-to-Speech via Monotonic Alignment Search »
Tue. Dec 8th 02:15 -- 02:30 AM Room Orals & Spotlights: Language/Audio Applications
More from the Same Authors
-
2022 : Progressive Deblurring of Diffusion Models for Coarse-to-Fine Image Synthesis »
Sangyun Lee · Hyungjin Chung · Jaehyeon Kim · Jong Chul Ye -
2022 : Sample-efficient Adversarial Imitation Learning »
Dahuin Jung · Hyungyu Lee · Sungroh Yoon -
2021 Poster: Reducing Information Bottleneck for Weakly Supervised Semantic Segmentation »
Jungbeom Lee · Jooyoung Choi · Jisoo Mok · Sungroh Yoon -
2020 Poster: NanoFlow: Scalable Normalizing Flows with Sublinear Parameter Complexity »
Sang-gil Lee · Sungwon Kim · Sungroh Yoon -
2017 Poster: Deep Recurrent Neural Network-Based Identification of Precursor microRNAs »
Seunghyun Park · Seonwoo Min · Hyun-Soo Choi · Sungroh Yoon -
2016 Poster: Neural Universal Discrete Denoiser »
Taesup Moon · Seonwoo Min · Byunghan Lee · Sungroh Yoon