Timezone: »
Pre-trained language models like BERT and its variants have recently achieved impressive performance in various natural language understanding tasks. However, BERT heavily relies on the global self-attention block and thus suffers large memory footprint and computation cost. Although all its attention heads query on the whole input sequence for generating the attention map from a global perspective, we observe some heads only need to learn local dependencies, which means existence of computation redundancy. We therefore propose a novel span-based dynamic convolution to replace these self-attention heads to directly model local dependencies. The novel convolution heads, together with the rest self-attention heads, form a new mixed attention block that is more efficient at both global and local context learning. We equip BERT with this mixed attention design and build a ConvBERT model. Experiments have shown that ConvBERT significantly outperforms BERT and its variants in various downstream tasks, with lower training cost and fewer model parameters. Remarkably, ConvBERTbase model achieves 86.4 GLUE score, 0.7 higher than ELECTRAbase, using less than 1/4 training cost. Code and pre-trained models will be released.
Author Information
Zi-Hang Jiang (National University of Singapore)
Weihao Yu (National University of Singapore)
Daquan Zhou (National University of Singapore)
Yunpeng Chen (Yitu Technology)
Jiashi Feng (National University of Singapore)
Shuicheng Yan (National University of Singapore)
Related Events (a corresponding poster, oral, or spotlight)
-
2020 Spotlight: ConvBERT: Improving BERT with Span-based Dynamic Convolution »
Tue. Dec 8th 03:30 -- 03:40 AM Room Orals & Spotlights: Language/Audio Applications
More from the Same Authors
-
2022 Poster: Inception Transformer »
Chenyang Si · Weihao Yu · Pan Zhou · Yichen Zhou · Xinchao Wang · Shuicheng Yan -
2022 Spotlight: Scaling & Shifting Your Features: A New Baseline for Efficient Model Tuning »
Dongze Lian · Daquan Zhou · Jiashi Feng · Xinchao Wang -
2022 Spotlight: Lightning Talks 6A-1 »
Ziyi Wang · Nian Liu · Yaming Yang · Qilong Wang · Yuanxin Liu · Zongxin Yang · Yizhao Gao · Yanchen Deng · Dongze Lian · Nanyi Fei · Ziyu Guan · Xiao Wang · Shufeng Kong · Xumin Yu · Daquan Zhou · Yi Yang · Fandong Meng · Mingze Gao · Caihua Liu · Yongming Rao · Zheng Lin · Haoyu Lu · Zhe Wang · Jiashi Feng · Zhaolin Zhang · Deyu Bo · Xinchao Wang · Chuan Shi · Jiangnan Li · Jiangtao Xie · Jie Zhou · Zhiwu Lu · Wei Zhao · Bo An · Jiwen Lu · Peihua Li · Jian Pei · Hao Jiang · Cai Xu · Peng Fu · Qinghua Hu · Yijie Li · Weigang Lu · Yanan Cao · Jianbin Huang · Weiping Wang · Zhao Cao · Jie Zhou -
2022 Spotlight: Inception Transformer »
Chenyang Si · Weihao Yu · Pan Zhou · Yichen Zhou · Xinchao Wang · Shuicheng Yan -
2022 Spotlight: Lightning Talks 2B-1 »
Yehui Tang · Jian Wang · Zheng Chen · man zhou · Peng Gao · Chenyang Si · SHANGKUN SUN · Yixing Xu · Weihao Yu · Xinghao Chen · Kai Han · Hu Yu · Yulun Zhang · Chenhui Gou · Teli Ma · Yuanqi Chen · Yunhe Wang · Hongsheng Li · Jinjin Gu · Jianyuan Guo · Qiman Wu · Pan Zhou · Yu Zhu · Jie Huang · Chang Xu · Yichen Zhou · Haocheng Feng · Guodong Guo · yongbing zhang · Ziyi Lin · Feng Zhao · Ge Li · Junyu Han · Jinwei Gu · Jifeng Dai · Chao Xu · Xinchao Wang · Linghe Kong · Shuicheng Yan · Yu Qiao · Chen Change Loy · Xin Yuan · Errui Ding · Yunhe Wang · Deyu Meng · Jingdong Wang · Chongyi Li -
2022 Poster: Deep Model Reassembly »
Xingyi Yang · Daquan Zhou · Songhua Liu · Jingwen Ye · Xinchao Wang -
2022 Poster: Scaling & Shifting Your Features: A New Baseline for Efficient Model Tuning »
Dongze Lian · Daquan Zhou · Jiashi Feng · Xinchao Wang -
2022 Poster: Sharpness-Aware Training for Free »
JIAWEI DU · Daquan Zhou · Jiashi Feng · Vincent Tan · Joey Tianyi Zhou -
2021 Workshop: Distribution shifts: connecting methods and applications (DistShift) »
Shiori Sagawa · Pang Wei Koh · Fanny Yang · Hongseok Namkoong · Jiashi Feng · Kate Saenko · Percy Liang · Sarah Bird · Sergey Levine -
2021 Poster: Towards Understanding Why Lookahead Generalizes Better Than SGD and Beyond »
Pan Zhou · Hanshu Yan · Xiaotong Yuan · Jiashi Feng · Shuicheng Yan -
2021 Poster: How Should Pre-Trained Language Models Be Fine-Tuned Towards Adversarial Robustness? »
Xinshuai Dong · Anh Tuan Luu · Min Lin · Shuicheng Yan · Hanwang Zhang -
2021 Poster: All Tokens Matter: Token Labeling for Training Better Vision Transformers »
Zi-Hang Jiang · Qibin Hou · Li Yuan · Daquan Zhou · Yujun Shi · Xiaojie Jin · Anran Wang · Jiashi Feng -
2021 Poster: Direct Multi-view Multi-person 3D Pose Estimation »
tao wang · Jianfeng Zhang · Yujun Cai · Shuicheng Yan · Jiashi Feng -
2020 Poster: Towards Theoretically Understanding Why Sgd Generalizes Better Than Adam in Deep Learning »
Pan Zhou · Jiashi Feng · Chao Ma · Caiming Xiong · Steven Chu Hong Hoi · Weinan E -
2020 Poster: Residual Distillation: Towards Portable Deep Neural Networks without Shortcuts »
Guilin Li · Junlei Zhang · Yunhe Wang · Chuanjian Liu · Matthias Tan · Yunfeng Lin · Wei Zhang · Jiashi Feng · Tong Zhang -
2020 Poster: Improving Generalization in Reinforcement Learning with Mixture Regularization »
KAIXIN WANG · Bingyi Kang · Jie Shao · Jiashi Feng -
2020 Poster: Inference Stage Optimization for Cross-scenario 3D Human Pose Estimation »
Jianfeng Zhang · Xuecheng Nie · Jiashi Feng -
2019 Poster: Efficient Meta Learning via Minibatch Proximal Update »
Pan Zhou · Xiaotong Yuan · Huan Xu · Shuicheng Yan · Jiashi Feng -
2019 Poster: Heterogeneous Graph Learning for Visual Commonsense Reasoning »
Weijiang Yu · Jingwen Zhou · Weihao Yu · Xiaodan Liang · Nong Xiao -
2019 Spotlight: Heterogeneous Graph Learning for Visual Commonsense Reasoning »
Weijiang Yu · Jingwen Zhou · Weihao Yu · Xiaodan Liang · Nong Xiao -
2019 Spotlight: Efficient Meta Learning via Minibatch Proximal Update »
Pan Zhou · Xiaotong Yuan · Huan Xu · Shuicheng Yan · Jiashi Feng -
2018 Poster: New Insight into Hybrid Stochastic Gradient Descent: Beyond With-Replacement Sampling and Convexity »
Pan Zhou · Xiaotong Yuan · Jiashi Feng -
2018 Poster: Efficient Stochastic Gradient Hard Thresholding »
Pan Zhou · Xiaotong Yuan · Jiashi Feng -
2018 Poster: A^2-Nets: Double Attention Networks »
Yunpeng Chen · Yannis Kalantidis · Jianshu Li · Shuicheng Yan · Jiashi Feng -
2017 Poster: Dual Path Networks »
Yunpeng Chen · Jianan Li · Huaxin Xiao · Xiaojie Jin · Shuicheng Yan · Jiashi Feng -
2017 Spotlight: Dual Path Networks »
Yunpeng Chen · Jianan Li · Huaxin Xiao · Xiaojie Jin · Shuicheng Yan · Jiashi Feng -
2017 Poster: Multimodal Learning and Reasoning for Visual Question Answering »
Ilija Ilievski · Jiashi Feng -
2017 Poster: Predicting Scene Parsing and Motion Dynamics in the Future »
Xiaojie Jin · Huaxin Xiao · Xiaohui Shen · Jimei Yang · Zhe Lin · Yunpeng Chen · Zequn Jie · Jiashi Feng · Shuicheng Yan -
2017 Poster: Dual-Agent GANs for Photorealistic and Identity Preserving Profile Face Synthesis »
Jian Zhao · Lin Xiong · Panasonic Karlekar Jayashree · Jianshu Li · Fang Zhao · Zhecan Wang · Panasonic Sugiri Pranata · Panasonic Shengmei Shen · Shuicheng Yan · Jiashi Feng -
2016 Poster: Tree-Structured Reinforcement Learning for Sequential Object Localization »
Zequn Jie · Xiaodan Liang · Jiashi Feng · Xiaojie Jin · Wen Lu · Shuicheng Yan -
2014 Poster: Robust Logistic Regression and Classification »
Jiashi Feng · Huan Xu · Shie Mannor · Shuicheng Yan -
2014 Poster: Convex Optimization Procedure for Clustering: Theoretical Revisit »
Changbo Zhu · Huan Xu · Chenlei Leng · Shuicheng Yan -
2014 Poster: On a Theory of Nonparametric Pairwise Similarity for Clustering: Connecting Clustering to Classification »
Yingzhen Yang · Feng Liang · Shuicheng Yan · Zhangyang Wang · Thomas S Huang -
2013 Poster: Online Robust PCA via Stochastic Optimization »
Jiashi Feng · Huan Xu · Shuicheng Yan -
2013 Poster: Online PCA for Contaminated Data »
Jiashi Feng · Huan Xu · Shie Mannor · Shuicheng Yan