Timezone: »
Sparse mixture of experts provides larger model capacity while requiring a constant computational overhead. It employs the routing mechanism to distribute input tokens to the best-matched experts according to their hidden representations. However, learning such a routing mechanism encourages token clustering around expert centroids, implying a trend toward representation collapse. In this work, we propose to estimate the routing scores between tokens and experts on a low-dimensional hypersphere. We conduct extensive experiments on cross-lingual language model pre-training and fine-tuning on downstream tasks. Experimental results across seven multilingual benchmarks show that our method achieves consistent gains. We also present a comprehensive analysis on the representation and routing behaviors of our models. Our method alleviates the representation collapse issue and achieves more consistent routing than the baseline mixture-of-experts methods.
Author Information
Zewen Chi (Beijing Institute of Technology)
Li Dong (Microsoft Research)
Shaohan Huang (Microsoft)
Damai Dai (Peking University)
Shuming Ma (Peking University)
Barun Patra (Microsoft)
Saksham Singhal (Microsoft)
Payal Bajaj (Microsoft)
XIA SONG (Microsoft)
Xian-Ling Mao (Beijing Institute of Technology)
Heyan Huang (Beijing Institute of Technology)
Furu Wei (Microsoft Research Asia)
More from the Same Authors
-
2022 Poster: VLMo: Unified Vision-Language Pre-Training with Mixture-of-Modality-Experts »
Hangbo Bao · Wenhui Wang · Li Dong · Qiang Liu · Owais Khan Mohammed · Kriti Aggarwal · Subhojit Som · Songhao Piao · Furu Wei -
2021 Poster: COCO-LM: Correcting and Contrasting Text Sequences for Language Model Pretraining »
Yu Meng · Chenyan Xiong · Payal Bajaj · saurabh tiwary · Paul Bennett · Jiawei Han · XIA SONG -
2020 Poster: BERT Loses Patience: Fast and Robust Inference with Early Exit »
Wangchunshu Zhou · Canwen Xu · Tao Ge · Julian McAuley · Ke Xu · Furu Wei -
2020 Poster: Pushing the Limits of Narrow Precision Inferencing at Cloud Scale with Microsoft Floating Point »
Bita Darvish Rouhani · Daniel Lo · Ritchie Zhao · Ming Liu · Jeremy Fowers · Kalin Ovtcharov · Anna Vinogradsky · Sarah Massengill · Lita Yang · Ray Bittner · Alessandro Forin · Haishan Zhu · Taesik Na · Prerak Patel · Shuai Che · Lok Chand Koppaka · XIA SONG · Subhojit Som · Kaustav Das · Saurabh K T · Steve Reinhardt · Sitaram Lanka · Eric Chung · Doug Burger -
2020 Poster: MiniLM: Deep Self-Attention Distillation for Task-Agnostic Compression of Pre-Trained Transformers »
Wenhui Wang · Furu Wei · Li Dong · Hangbo Bao · Nan Yang · Ming Zhou -
2019 Poster: Unified Language Model Pre-training for Natural Language Understanding and Generation »
Li Dong · Nan Yang · Wenhui Wang · Furu Wei · Xiaodong Liu · Yu Wang · Jianfeng Gao · Ming Zhou · Hsiao-Wuen Hon