In relation to launching the Google @home product, we were faced with the problem of far-field speech recognition. That setting gives rise to problems related to reverberant and noisy speech which degrades speech recognition performance. A common approach to address some of these detrimental effects is to use multi-channel processing. This processing is generally seen as an "enhancement" step prior to ASR and is developed and optimized as a separate component of the overall system. In our work, we integrated this component into the neural network that is tasked with the speech recognition classification task. This allows for a joint optimization of the enhancement and recognition components. And given that the structure of the input layer of the network is based on the "classical" structure of the enhancement component, it allows us to interpret what type of representation the network learned. We will show that in some cases this learned representation appears to mimic what was discovered by previous research and in some cases, the learned representation seems "esoteric".
The second part of this talk will focus on an end-to-end letter to sound model for Japanese. Japanese uses a complex orthography where the pronunciation of the Chinese characters, which are a part of the script, varies depending on the context. The fact that Japanese (like Chinese and Korean) does not explicitly mark word boundaries in the orthography further complicates this mapping. We show results of an end-to-end, encoder/decoder model structure to learn the letter-to-sound relationship. These systems are trained from speech data coming through our systems. This shows that such models are capable of learning the mapping (with accuracies exceeding 90% for a number of model topologies). Observing the learned representation and attention distributions for various architectures provides some insight as to what cues the model uses to learn the relationship. But it also shows that interpretation remains limited since the joint optimization of encoder and decoder components allows the model the freedom to learn implicit representations that are not directly amenable to interpretation.
Michiel Bacchiani (Google Inc.)
More from the Same Authors
2018 : Panel Discussion »
Rich Caruana · Mike Schuster · Ralf Schlüter · Hynek Hermansky · Renato De Mori · Samy Bengio · Michiel Bacchiani · Jason Eisner