Timezone: »
The Transformer is an extremely powerful and prominent deep learning architecture. In this work, we challenge the commonly held belief in deep learning that going deeper is better, and show an alternative design approach that is building wider attention Transformers. We demonstrate that wide single layer Transformer models can compete with or outperform deeper ones in a variety of Natural Language Processing (NLP) tasks when both are trained from scratch. The impact of changing the model aspect ratio on Transformers is then studied systematically. This ratio balances the number of layers and the number of attention heads per layer while keeping the total number of attention heads and all other hyperparameters constant. On average, across 4 NLP tasks and 10 attention types, single layer wide models perform 0.3% better than their deep counterparts. We show an in-depth evaluation and demonstrate how wide models require a far smaller memory footprint and can run faster on commodity hardware, in addition, these wider models are also more interpretable. For example, a single layer Transformer on the IMDb byte level text classification has 3.1x faster inference latency on a CPU than its equally accurate deeper counterpart, and is half the size. Our results suggest that the critical direction for building better Transformers for NLP is their width, and that their depth is less relevant.
Author Information
Jason Brown (University of Cambridge)
Yiren Zhao (University of Cambridge)
I Shumailov (University of Toronto)
Robert Mullins (University of Cambridge)
Related Events (a corresponding poster, oral, or spotlight)
-
2022 : Wide Attention Is The Way Forward For Transformers »
Fri. Dec 2nd 09:20 -- 09:30 PM Room
More from the Same Authors
-
2021 : DAdaQuant: Doubly-adaptive quantization for communication-efficient Federated Learning »
Robert Hönig · Yiren Zhao · Robert Mullins -
2022 : Dynamic Head Pruning in Transformers »
Prisha Satwani · yiren zhao · Vidhi Lalchand · Robert Mullins -
2022 : Revisiting Graph Neural Network Embeddings »
Skye Purchase · yiren zhao · Robert Mullins -
2022 : DARTFormer: Finding The Best Type Of Attention »
Jason Brown · Yiren Zhao · I Shumailov · Robert Mullins -
2022 Poster: Rapid Model Architecture Adaption for Meta-Learning »
Yiren Zhao · Xitong Gao · I Shumailov · Nicolo Fusi · Robert Mullins -
2022 Poster: In Differential Privacy, There is Truth: on Vote-Histogram Leakage in Ensemble Private Learning »
JIAQI WANG · Roei Schuster · I Shumailov · David Lie · Nicolas Papernot -
2022 Poster: On the Limitations of Stochastic Pre-processing Defenses »
Yue Gao · I Shumailov · Kassem Fawaz · Nicolas Papernot -
2021 Poster: Manipulating SGD with Data Ordering Attacks »
I Shumailov · Zakhar Shumaylov · Dmitry Kazhdan · Yiren Zhao · Nicolas Papernot · Murat Erdogdu · Ross J Anderson -
2019 Poster: Focused Quantization for Sparse CNNs »
Yiren Zhao · Xitong Gao · Daniel Bates · Robert Mullins · Cheng-Zhong Xu