Timezone: »
We introduce the Globally Normalized Autoregressive Transducer (GNAT) for addressing the label bias problem in streaming speech recognition. Our solution admits a tractable exact computation of the denominator for the sequence-level normalization. Through theoretical and empirical results, we demonstrate that by switching to a globally normalized model, the word error rate gap between streaming and non-streaming speech-recognition models can be greatly reduced (by more than 50% on the Librispeech dataset). This model is developed in a modular framework which encompasses all the common neural speech recognition models. The modularity of this framework enables controlled comparison of modelling choices and creation of new models. A JAX implementation of our models has been open sourced.
Author Information
Ehsan Variani (Google)
I am a Staff Research Scientist in Google. My main research focus is statistical and machine learning and information theory with focus on speech and language recognition.
Ke Wu (Google Inc)
Michael D Riley (Google)
David Rybach (Google)
Matt Shannon (Google)
Cyril Allauzen (Google)
More from the Same Authors
-
2021 : FedJAX: Federated learning simulation with JAX »
Jae Hun Ro · Ananda Theertha Suresh · Ke Wu -
2020 Poster: Learning discrete distributions: user vs item-level privacy »
Yuhan Liu · Ananda Theertha Suresh · Felix Xinnan Yu · Sanjiv Kumar · Michael D Riley -
2018 : Workshop Opening »
Mirco Ravanelli · Dmitriy Serdyuk · Ehsan Variani · Bhuvana Ramabhadran -
2018 Workshop: Interpretability and Robustness in Audio, Speech, and Language »
Mirco Ravanelli · Dmitriy Serdyuk · Ehsan Variani · Bhuvana Ramabhadran