NIPS 2016
Skip to yearly menu bar Skip to main content


End-to-end Learning for Speech and Audio Processing

John Hershey · Philemon Brakel

Hilton Diag. Mar, Blrm. A

This workshop focuses on recent advances to end-to-end methods for speech and more general audio processing. Deep learning has transformed the state of the art in speech recognition, and audio analysis in general. In recent developments, new deep learning architectures have made it possible to integrate the entire inference process into an end-to-end system. This involves solving problems of an algorithmic nature, such as search over time alignments between different domains, and dynamic tracking of changing input conditions. Topics include automatic speech recognition systems (ASR) and other audio procssing systems that subsume front-end adaptive microphone array processing and source separation as well as back-end constructs such as phonetic context dependency, dynamic time alignment, or phoneme to grapheme modeling. Other end-to-end audio applications include speaker diarization, source separation, and music transcription. A variety of architectures have been proposed for such systems, ranging from shift-invariant convolutional pooling to connectionist temporal classification (CTC) and attention based mechanisms, or other novel dynamic components. However there has been little comparison yet in the literature of the relative merits of the different approaches. This workshop delves into questions about how different approaches handle various trade-offs in terms of modularity and integration, in terms of representation and generalization. This is an exciting new area and we expect significant interest from the machine learning and speech and audio processing communities.

Live content is unavailable. Log in and register to view live content

Timezone: America/Los_Angeles


Log in and register to view live content