Timezone: »
Sparsely-activated Mixture-of-experts (MoE) models allow the number of parameters to greatly increase while keeping the amount of computation for a given token or a given sample unchanged. However, a poor expert routing strategy (e.g. one resulting in load imbalance) can cause certain experts to be under-trained, leading to an expert being under or over-specialized. Prior work allocates a fixed number of experts to each token using a top-k function regardless of the relative importance of different tokens. To address this, we propose a heterogeneous mixture-of-experts employing an expert choice method. Instead of letting tokens select the top-k experts, we have experts selecting the top-k tokens. As a result, each token can be routed to a variable number of experts and each expert can have a fixed bucket size. We systematically study pre-training speedups using the same computational resources of the Switch Transformer top-1 and GShard top-2 gating of prior work and find that our method improves training convergence time by more than 2×. For the same computational cost, our method demonstrates higher performance in fine-tuning 11 selected tasks in the GLUE and SuperGLUE benchmarks. For a smaller activation cost, our method outperforms the T5 dense model in 7 out of the 11 tasks.
Author Information
Yanqi Zhou (Google Brain)
Tao Lei (MIT)
Hanxiao Liu (Google Brain)
Nan Du (Google Brain)
Yanping Huang (Google Brain)
Vincent Zhao (Google Research, Brain)
Andrew Dai (Google)
zhifeng Chen (Google Brain)
Quoc V Le (Google)
James Laudon (Google)
More from the Same Authors
-
2021 : BEDS-Bench: Behavior of EHR-models under Distributional Shift - A Benchmark »
Anand Avati · Martin Seneviratne · Yuan Xue · Zhen Xu · Balaji Lakshminarayanan · Andrew Dai -
2022 Poster: Chain-of-Thought Prompting Elicits Reasoning in Large Language Models »
Jason Wei · Xuezhi Wang · Dale Schuurmans · Maarten Bosma · brian ichter · Fei Xia · Ed Chi · Quoc V Le · Denny Zhou -
2022 Poster: TabNAS: Rejection Sampling for Neural Architecture Search on Tabular Datasets »
Chengrun Yang · Gabriel Bender · Hanxiao Liu · Pieter-Jan Kindermans · Madeleine Udell · Yifeng Lu · Quoc V Le · Da Huang -
2021 Poster: CoAtNet: Marrying Convolution and Attention for All Data Sizes »
Zihang Dai · Hanxiao Liu · Quoc V Le · Mingxing Tan -
2021 Poster: Searching for Efficient Transformers for Language Modeling »
David So · Wojciech Mańke · Hanxiao Liu · Zihang Dai · Noam Shazeer · Quoc V Le -
2021 Poster: Pay Attention to MLPs »
Hanxiao Liu · Zihang Dai · David So · Quoc V Le -
2020 : Panel Discussion & Closing »
Yejin Choi · Alexei Efros · Chelsea Finn · Kristen Grauman · Quoc V Le · Yann LeCun · Ruslan Salakhutdinov · Eric Xing -
2020 : Panel discussion 2 »
Danielle S Bassett · Yoshua Bengio · Cristina Savin · David Duvenaud · Anna Choromanska · Yanping Huang -
2020 : Introduction: Cristina Savin »
Yanping Huang -
2020 : Introduction: David Duvenaud »
Yanping Huang -
2020 Workshop: Beyond BackPropagation: Novel Ideas for Training Neural Architectures »
Mateusz Malinowski · Grzegorz Swirszcz · Viorica Patraucean · Marco Gori · Yanping Huang · Sindy Löwe · Anna Choromanska -
2020 : Live Intro »
Mateusz Malinowski · Viorica Patraucean · Grzegorz Swirszcz · Sindy Löwe · Anna Choromanska · Marco Gori · Yanping Huang -
2020 Poster: Evolving Normalization-Activation Layers »
Hanxiao Liu · Andy Brock · Karen Simonyan · Quoc V Le -
2020 Poster: Transferable Graph Optimizers for ML Compilers »
Yanqi Zhou · Sudip Roy · Amirali Abdolrashidi · Daniel Wong · Peter Ma · Qiumin Xu · Hanxiao Liu · Phitchaya Phothilimtha · Shen Wang · Anna Goldie · Azalia Mirhoseini · James Laudon -
2020 Spotlight: Evolving Normalization-Activation Layers »
Hanxiao Liu · Andy Brock · Karen Simonyan · Quoc V Le -
2020 Oral: Transferable Graph Optimizers for ML Compilers »
Yanqi Zhou · Sudip Roy · Amirali Abdolrashidi · Daniel Wong · Peter Ma · Qiumin Xu · Hanxiao Liu · Phitchaya Phothilimtha · Shen Wang · Anna Goldie · Azalia Mirhoseini · James Laudon -
2020 Poster: PyGlove: Symbolic Programming for Automated Machine Learning »
Daiyi Peng · Xuanyi Dong · Esteban Real · Mingxing Tan · Yifeng Lu · Gabriel Bender · Hanxiao Liu · Adam Kraft · Chen Liang · Quoc V Le -
2020 Poster: RandAugment: Practical Automated Data Augmentation with a Reduced Search Space »
Ekin Dogus Cubuk · Barret Zoph · Jonathon Shlens · Quoc V Le -
2020 Oral: PyGlove: Symbolic Programming for Automated Machine Learning »
Daiyi Peng · Xuanyi Dong · Esteban Real · Mingxing Tan · Yifeng Lu · Gabriel Bender · Hanxiao Liu · Adam Kraft · Chen Liang · Quoc V Le -
2020 Poster: Just Pick a Sign: Optimizing Deep Multitask Models with Gradient Sign Dropout »
Zhao Chen · Jiquan Ngiam · Yanping Huang · Thang Luong · Henrik Kretzschmar · Yuning Chai · Dragomir Anguelov -
2020 Poster: Rethinking Pre-training and Self-training »
Barret Zoph · Golnaz Ghiasi · Tsung-Yi Lin · Yin Cui · Hanxiao Liu · Ekin Dogus Cubuk · Quoc V Le -
2020 Oral: Rethinking Pre-training and Self-training »
Barret Zoph · Golnaz Ghiasi · Tsung-Yi Lin · Yin Cui · Hanxiao Liu · Ekin Dogus Cubuk · Quoc V Le -
2020 Poster: Unsupervised Data Augmentation for Consistency Training »
Qizhe Xie · Zihang Dai · Eduard Hovy · Thang Luong · Quoc V Le -
2020 Poster: Learning to Select Best Forecast Tasks for Clinical Outcome Prediction »
Yuan Xue · Nan Du · Anne Mottram · Martin Seneviratne · Andrew Dai -
2020 Poster: Funnel-Transformer: Filtering out Sequential Redundancy for Efficient Language Processing »
Zihang Dai · Guokun Lai · Yiming Yang · Quoc V Le -
2019 : Poster Session »
Ayse Cakmak · Yunkai Zhang · Srijith Prabhakarannair Kusumam · Mohamed Osama Ahmed · Xintao Wu · Jayesh Choudhari · David I Inouye · Thomas Taylor · Michel Besserve · Ali Caner Turkmen · Kazi Islam · Antonio Artés · Amrith Setlur · Zhanghua Fu · Zhen Han · Abir De · Nan Du · Pablo Sanchez-Martin -
2019 : Poster Session I »
Shuangjia Zheng · Arnav Kapur · Umar Asif · Eyal Rozenberg · Cyprien Gilet · Oleksii Sidorov · Yogesh Kumar · Tom Van Steenkiste · William Boag · David Ouyang · Paul Jaeger · Sheng Liu · Aparna Balagopalan · Deepta Rajan · Marta Skreta · Nikhil Pattisapu · Jann Goschenhofer · Viraj Prabhu · Di Jin · Laura-Jayne Gardiner · Irene Li · sriram kumar · Qiyuan Hu · Mehul Motani · Justin Lovelace · Usman Roshan · Lucy Lu Wang · Ilya Valmianski · Hyeonwoo Lee · Sunil Mallya · Elias Chaibub Neto · Jonas Kemp · Marie Charpignon · Amber Nigam · Wei-Hung Weng · Sabri Boughorbel · Alexis Bellot · Lovedeep Gondara · Haoran Zhang · Taha Bahadori · John Zech · Rulin Shao · Edward Choi · Laleh Seyyed-Kalantari · Emily Aiken · Ioana Bica · Yiqiu Shen · Kieran Chin-Cheong · Subhrajit Roy · Ioana Baldini · So Yeon Min · Dirk Deschrijver · Pekka Marttinen · Damian Pascual Ortiz · Supriya Nagesh · Niklas Rindtorff · Andriy Mulyar · Katharina Hoebel · Martha Shaka · Pierre Machart · Leon Gatys · Nathan Ng · Matthias Hüser · Devin Taylor · Dennis Barbour · Natalia Martinez · Clara McCreery · Benjamin Eyre · Vivek Natarajan · Ren Yi · Ruibin Ma · Chirag Nagpal · Nan Du · Chufan Gao · Anup Tuladhar · Sam Shleifer · Jason Ren · Pouria Mashouri · Ming Yang Lu · Farideh Bagherzadeh-Khiabani · Olivia Choudhury · Maithra Raghu · Scott Fleming · Mika Jain · GUO YANG · Alena Harley · Stephen Pfohl · Elisabeth Rumetshofer · Alex Fedorov · Saloni Dash · Jacob Pfau · Sabina Tomkins · Colin Targonski · Michael Brudno · Xinyu Li · Yiyang Yu · Nisarg Patel -
2019 Poster: XLNet: Generalized Autoregressive Pretraining for Language Understanding »
Zhilin Yang · Zihang Dai · Yiming Yang · Jaime Carbonell · Russ Salakhutdinov · Quoc V Le -
2019 Oral: XLNet: Generalized Autoregressive Pretraining for Language Understanding »
Zhilin Yang · Zihang Dai · Yiming Yang · Jaime Carbonell · Russ Salakhutdinov · Quoc V Le -
2019 Poster: CondConv: Conditionally Parameterized Convolutions for Efficient Inference »
Brandon Yang · Gabriel Bender · Quoc V Le · Jiquan Ngiam -
2019 Poster: Mixtape: Breaking the Softmax Bottleneck Efficiently »
Zhilin Yang · Thang Luong · Russ Salakhutdinov · Quoc V Le -
2019 Poster: Saccader: Improving Accuracy of Hard Attention Models for Vision »
Gamaleldin Elsayed · Simon Kornblith · Quoc V Le -
2019 Poster: GPipe: Efficient Training of Giant Neural Networks using Pipeline Parallelism »
Yanping Huang · Youlong Cheng · Ankur Bapna · Orhan Firat · Dehao Chen · Mia Chen · HyoukJoong Lee · Jiquan Ngiam · Quoc V Le · Yonghui Wu · zhifeng Chen -
2019 Poster: High Fidelity Video Prediction with Large Stochastic Recurrent Neural Networks »
Ruben Villegas · Arkanath Pathak · Harini Kannan · Dumitru Erhan · Quoc V Le · Honglak Lee -
2018 Poster: Memory Augmented Policy Optimization for Program Synthesis and Semantic Parsing »
Chen Liang · Mohammad Norouzi · Jonathan Berant · Quoc V Le · Ni Lao -
2018 Spotlight: Memory Augmented Policy Optimization for Program Synthesis and Semantic Parsing »
Chen Liang · Mohammad Norouzi · Jonathan Berant · Quoc V Le · Ni Lao -
2018 Poster: DropBlock: A regularization method for convolutional networks »
Golnaz Ghiasi · Tsung-Yi Lin · Quoc V Le -
2018 Poster: Transfer Learning from Speaker Verification to Multispeaker Text-To-Speech Synthesis »
Ye Jia · Yu Zhang · Ron Weiss · Quan Wang · Jonathan Shen · Fei Ren · zhifeng Chen · Patrick Nguyen · Ruoming Pang · Ignacio Lopez Moreno · Yonghui Wu -
2018 Poster: Learning Temporal Point Processes via Reinforcement Learning »
Shuang Li · Shuai Xiao · Shixiang Zhu · Nan Du · Yao Xie · Le Song -
2018 Spotlight: Learning Temporal Point Processes via Reinforcement Learning »
Shuang Li · Shuai Xiao · Shixiang Zhu · Nan Du · Yao Xie · Le Song -
2017 Symposium: Metalearning »
Risto Miikkulainen · Quoc V Le · Kenneth Stanley · Chrisantha Fernando -
2017 Poster: Style Transfer from Non-Parallel Text by Cross-Alignment »
Tianxiao Shen · Tao Lei · Regina Barzilay · Tommi Jaakkola -
2017 Spotlight: Style Transfer from Non-parallel Text by Cross-Alignment »
Tianxiao Shen · Tao Lei · Regina Barzilay · Tommi Jaakkola -
2016 Poster: An Online Sequence-to-Sequence Model Using Partial Conditioning »
Navdeep Jaitly · Quoc V Le · Oriol Vinyals · Ilya Sutskever · David Sussillo · Samy Bengio -
2016 Poster: Reward Augmented Maximum Likelihood for Neural Structured Prediction »
Mohammad Norouzi · Samy Bengio · zhifeng Chen · Navdeep Jaitly · Mike Schuster · Yonghui Wu · Dale Schuurmans -
2015 Poster: Semi-supervised Sequence Learning »
Andrew Dai · Quoc V Le -
2014 Poster: Sequence to Sequence Learning with Neural Networks »
Ilya Sutskever · Oriol Vinyals · Quoc V Le -
2014 Oral: Sequence to Sequence Learning with Neural Networks »
Ilya Sutskever · Oriol Vinyals · Quoc V Le -
2013 Workshop: Randomized Methods for Machine Learning »
David Lopez-Paz · Quoc V Le · Alexander Smola