Timezone: »
Transformers with multi-head self-attention have achieved remarkable success in sequence modeling and beyond. However, they suffer from high computational and memory complexities for computing the attention matrix at each head. Recently, it has been shown that those attention matrices lie on a low-dimensional manifold and, thus, are redundant. We propose the Transformer with a Finite Admixture of Shared Heads (FiSHformers), a novel class of efficient and flexible transformers that allow the sharing of attention matrices between attention heads. At the core of FiSHformer is a novel finite admixture model of shared heads (FiSH) that samples attention matrices from a set of global attention matrices. The number of global attention matrices is much smaller than the number of local attention matrices generated. FiSHformers directly learn these global attention matrices rather than the local ones as in other transformers, thus significantly improving the computational and memory efficiency of the model. We empirically verify the advantages of the FiSHformer over the baseline transformers in a wide range of practical applications including language modeling, machine translation, and image classification. On the WikiText-103, IWSLT'14 De-En and WMT'14 En-De, FiSHformers use much fewer floating-point operations per second (FLOPs), memory, and parameters compared to the baseline transformers.
Author Information
Tan Nguyen (University of California, Los Angeles)
Tam Nguyen (FPT Software)
Hai Do (Vietnam National University Hanoi)
Khai Nguyen (University of Texas, Austin)
Vishwanath Saragadam (Rice University)
Minh Pham (University of California, Los Angeles)
Khuong Duy Nguyen (JAIST)
Nhat Ho (University of Texas at Austin)
Stanley Osher (UCLA)
More from the Same Authors
-
2022 : Statistical and Computational Complexities of BFGS Quasi-Newton Method for Generalized Linear Models »
Qiujiang Jin · Aryan Mokhtari · Nhat Ho · Tongzheng Ren -
2022 Poster: Amortized Projection Optimization for Sliced Wasserstein Generative Models »
Khai Nguyen · Nhat Ho -
2022 Poster: Revisiting Sliced Wasserstein on Images: From Vectorization to Convolution »
Khai Nguyen · Nhat Ho -
2022 Poster: Stochastic Multiple Target Sampling Gradient Descent »
Hoang Phan · Ngoc Tran · Trung Le · Toan Tran · Nhat Ho · Dinh Phung -
2022 Poster: Beyond black box densities: Parameter learning for the deviated components »
Dat Do · Nhat Ho · XuanLong Nguyen -
2022 Poster: Improving Neural Ordinary Differential Equations with Nesterov's Accelerated Gradient Method »
Ho Huu Nghia Nguyen · Tan Nguyen · Huyen Vo · Stanley Osher · Thieu Vo -
2022 Poster: FourierFormer: Transformer Meets Generalized Fourier Integral Theorem »
Tan Nguyen · Minh Pham · Tam Nguyen · Khai Nguyen · Stanley Osher · Nhat Ho -
2021 : Stan Osher Talk »
Stanley Osher -
2021 Poster: Structured Dropout Variational Inference for Bayesian Neural Networks »
Son Nguyen · Duong Nguyen · Khai Nguyen · Khoat Than · Hung Bui · Nhat Ho -
2021 Poster: On Robust Optimal Transport: Computational Complexity and Barycenter Computation »
Khang Le · Huy Nguyen · Quang M Nguyen · Tung Pham · Hung Bui · Nhat Ho -
2021 Poster: FMMformer: Efficient and Flexible Transformer via Decomposed Near-field and Far-field Attention »
Tan Nguyen · Vai Suliafu · Stanley Osher · Long Chen · Bao Wang -
2021 Poster: Heavy Ball Neural Ordinary Differential Equations »
Hedi Xia · Vai Suliafu · Hangjie Ji · Tan Nguyen · Andrea Bertozzi · Stanley Osher · Bao Wang -
2020 Poster: Projection Robust Wasserstein Distance and Riemannian Optimization »
Tianyi Lin · Chenyou Fan · Nhat Ho · Marco Cuturi · Michael Jordan -
2020 Poster: Fixed-Support Wasserstein Barycenters: Computational Hardness and Fast Algorithm »
Tianyi Lin · Nhat Ho · Xi Chen · Marco Cuturi · Michael Jordan -
2020 Spotlight: Projection Robust Wasserstein Distance and Riemannian Optimization »
Tianyi Lin · Chenyou Fan · Nhat Ho · Marco Cuturi · Michael Jordan -
2020 Poster: MomentumRNN: Integrating Momentum into Recurrent Neural Networks »
Tan Nguyen · Richard Baraniuk · Andrea Bertozzi · Stanley Osher · Bao Wang -
2019 Poster: ResNets Ensemble via the Feynman-Kac Formalism to Improve Natural and Robust Accuracies »
Bao Wang · Zuoqiang Shi · Stanley Osher -
2018 Poster: Deep Neural Nets with Interpolating Function as Output Activation »
Bao Wang · Xiyang Luo · Zhen Li · Wei Zhu · Zuoqiang Shi · Stanley Osher