Timezone: »
Transformers are expensive to train due to the quadratic time and space complexity in the self-attention mechanism. On the other hand, although kernel machines suffer from the same computation bottleneck in pairwise dot products, several approximation schemes have been successfully incorporated to considerably reduce their computational cost without sacrificing too much accuracy. In this work, we leverage the computation methods for kernel machines to alleviate the high computational cost and introduce Skyformer, which replaces the softmax structure with a Gaussian kernel to stabilize the model training and adapts the Nyström method to a non-positive semidefinite matrix to accelerate the computation. We further conduct theoretical analysis by showing that the matrix approximation error of our proposed method is small in the spectral norm. Experiments on Long Range Arena benchmark show that the proposed method is sufficient in getting comparable or even better performance than the full self-attention while requiring fewer computation resources.
Author Information
Yifan Chen (University of Illinois, Urbana Champaign)
Qi Zeng (University of Illinois, Urbana Champaign)
Heng Ji (University of Illinois)
Yun Yang (University of Illinois, Urbana Champaign)
More from the Same Authors
-
2023 Poster: Paxion: Patching Action Knowledge in Video-Language Foundation Models »
Zhenhailong Wang · Ansel Blume · Sha Li · Genglin Liu · Jaemin Cho · Zineng Tang · Mohit Bansal · Heng Ji -
2023 Poster: Revisiting Out-of-distribution Robustness in NLP: Benchmarks, Analysis, and LLMs Evaluations »
Lifan Yuan · Yangyi Chen · Ganqu Cui · Hongcheng Gao · FangYuan Zou · Xingyi Cheng · Heng Ji · Zhiyuan Liu · Maosong Sun -
2022 Poster: Language Models with Image Descriptors are Strong Few-Shot Video-Language Learners »
Zhenhailong Wang · Manling Li · Ruochen Xu · Luowei Zhou · Jie Lei · Xudong Lin · Shuohang Wang · Ziyi Yang · Chenguang Zhu · Derek Hoiem · Shih-Fu Chang · Mohit Bansal · Heng Ji -
2022 Poster: Wasserstein $K$-means for clustering probability distributions »
Yubo Zhuang · Xiaohui Chen · Yun Yang -
2020 : Panel #2 »
Oren Etzioni · Heng Ji · Subbarao Kambhampati · Victoria Lin · Jiajun Wu -
2020 : Q&A #2 »
Heng Ji · Jure Leskovec · Jiajun Wu -
2020 : Invited Talk #5 »
Heng Ji -
2020 Demonstration: A Knowledge Graph Reasoning Prototype »
Lihui Liu · Boxin Du · Heng Ji · Hanghang Tong