Timezone: »

A versatile and efficient approach to summarize speech into utterance-level representations
Joao Monteiro · JAHANGIR ALAM · Tiago H Falk

Time delay neural networks (TDNN) have become ubiquitous for voice biometrics and language recognition tasks relying on utterance-level speaker- or language-dependent representations. In this paper, we discuss directions to improve upon the conventional TDNN architecture to render it more generally applicable. More specifically, we explore the utility of performing pooling operations across different levels of the convolutional stack and further propose an approach to efficiently combine such set of representations. We show that the resulting models are more versatile, in the sense that a fixed architecture can be re-used across different tasks, and learned representations are more discriminative. Evaluations are performed across two settings: (1) two sub-tasks for spoofing attack detection, and (2) three sub-tasks for spoken language identification. Results show the proposed design yielding improvements over the original TDNN architecture, as well as other previously proposed methods.

Author Information

Joao Monteiro (Institut National de la Recherche Scientifique)
JAHANGIR ALAM (Computer Research Institute of Montreal (CRIM))
Tiago H Falk (INRS-EMT)

Related Events (a corresponding poster, oral, or spotlight)

More from the Same Authors