Timezone: »
Oral
Efficient Exact Gradient Update for training Deep Networks with Very Large Sparse Targets
Pascal Vincent · Alexandre de Brébisson · Xavier Bouthillier
An important class of problems involves training deep neural networks with sparse prediction targets of very high dimension D. These occur naturally in e.g. neural language models or the learning of word-embeddings, often posed as predicting the probability of next words among a vocabulary of size D (e.g. 200,000). Computing the equally large, but typically non-sparse D-dimensional output vector from a last hidden layer of reasonable dimension d (e.g. 500) incurs a prohibitive $O(Dd)$ computational cost for each example, as does updating the $D \times d$ output weight matrix and computing the gradient needed for backpropagation to previous layers. While efficient handling of large sparse network inputs is trivial, this case of large sparse targets is not, and has thus so far been sidestepped with approximate alternatives such as hierarchical softmax or sampling-based approximations during training. In this work we develop an original algorithmic approach that, for a family of loss functions that includes squared error and spherical softmax, can compute the exact loss, gradient update for the output weights, and gradient for backpropagation, all in $O(d^2)$ per example instead of $O(Dd)$, remarkably without ever computing the D-dimensional output. The proposed algorithm yields a speedup of $\frac{D}{4d}$, i.e. two orders of magnitude for typical sizes, for that critical part of the computations that often dominates the training time in this kind of network architecture.
Author Information
Pascal Vincent (U. Montreal)
Alexandre de Brébisson (MILA, University of Montréal)
Xavier Bouthillier (Universit de Montréal)
More from the Same Authors
-
2020 Poster: Adversarial Example Games »
Joey Bose · Gauthier Gidel · Hugo Berard · Andre Cianflone · Pascal Vincent · Simon Lacoste-Julien · Will Hamilton -
2019 : Catered Lunch and Poster Viewing (in Workshop Room) »
Gustavo Stolovitzky · Prabhu Pradhan · Pablo Duboue · Zhiwen Tang · Aleksei Natekin · Elizabeth Bondi-Kelly · Xavier Bouthillier · Stephanie Milani · Heimo Müller · Andreas T. Holzinger · Stefan Harrer · Ben Day · Andrey Ustyuzhanin · William Guss · Mahtab Mirmomeni -
2019 Workshop: Retrospectives: A Venue for Self-Reflection in ML Research »
Ryan Lowe · Yoshua Bengio · Joelle Pineau · Michela Paganini · Jessica Forde · Shagun Sodhani · Abhishek Gupta · Joel Lehman · Peter Henderson · Kanika Madan · Koustuv Sinha · Xavier Bouthillier -
2019 Poster: MelGAN: Generative Adversarial Networks for Conditional Waveform Synthesis »
Kundan Kumar · Rithesh Kumar · Thibault de Boissiere · Lucas Gestin · Wei Zhen Teoh · Jose Sotelo · Alexandre de Brébisson · Yoshua Bengio · Aaron Courville -
2018 Poster: Fast Approximate Natural Gradient Descent in a Kronecker Factored Eigenbasis »
Thomas George · César Laurent · Xavier Bouthillier · Nicolas Ballas · Pascal Vincent -
2017 : ObamaNet: Photo-realistic lip-sync from text »
Rithesh Kumar · Jose Sotelo · Kundan Kumar · Alexandre de Brébisson -
2017 Demonstration: A Deep Reinforcement Learning Chatbot »
Iulian Vlad Serban · Chinnadhurai Sankar · Mathieu Germain · Saizheng Zhang · Zhouhan Lin · Sandeep Subramanian · Taesup Kim · Michael Pieper · Sarath Chandar · Nan Rosemary Ke · Sai Rajeswar Mudumba · Alexandre de Brébisson · Jose Sotelo · Dendi A Suhubdy · Vincent Michalski · Joelle Pineau · Yoshua Bengio -
2015 Poster: Efficient Exact Gradient Update for training Deep Networks with Very Large Sparse Targets »
Pascal Vincent · Alexandre de Brébisson · Xavier Bouthillier -
2013 Poster: Generalized Denoising Auto-Encoders as Generative Models »
Yoshua Bengio · Li Yao · Guillaume Alain · Pascal Vincent -
2011 Oral: The Manifold Tangent Classifier »
Salah Rifai · Yann N Dauphin · Pascal Vincent · Yoshua Bengio · Xavier Muller -
2011 Poster: The Manifold Tangent Classifier »
Salah Rifai · Yann N Dauphin · Pascal Vincent · Yoshua Bengio · Xavier Muller