Timezone: »
Poster
Expediting Large-Scale Vision Transformer for Dense Prediction without Fine-tuning
Weicong Liang · YUHUI YUAN · Henghui Ding · Xiao Luo · Weihong Lin · Ding Jia · Zheng Zhang · Chao Zhang · Han Hu
Vision transformers have recently achieved competitive results across various vision tasks but still suffer from heavy computation costs when processing a large number of tokens. Many advanced approaches have been developed to reduce the total number of tokens in the large-scale vision transformers, especially for image classification tasks. Typically, they select a small group of essential tokens according to their relevance with the [\texttt{class}] token, then fine-tune the weights of the vision transformer. Such fine-tuning is less practical for dense prediction due to the much heavier computation and GPU memory cost than image classification.In this paper, we focus on a more challenging problem, \ie, accelerating large-scale vision transformers for dense prediction without any additional re-training or fine-tuning. In response to the fact that high-resolution representations are necessary for dense prediction, we present two non-parametric operators, a \emph{token clustering layer} to decrease the number of tokens and a \emph{token reconstruction layer} to increase the number of tokens. The following steps are performed to achieve this: (i) we use the token clustering layer to cluster the neighboring tokens together, resulting in low-resolution representations that maintain the spatial structures; (ii) we apply the following transformer layers only to these low-resolution representations or clustered tokens; and (iii) we use the token reconstruction layer to re-create the high-resolution representations from the refined low-resolution representations. The results obtained by our method are promising on five dense prediction tasks including object detection, semantic segmentation, panoptic segmentation, instance segmentation, and depth estimation. Accordingly, our method accelerates $40\%\uparrow$ FPS and saves $30\%\downarrow$ GFLOPs of ``Segmenter+ViT-L/$16$'' while maintaining $99.5\%$ of the performance on ADE$20$K without fine-tuning the official weights.
Author Information
Weicong Liang (Key Laboratory of Machine Perception (MOE) School of Intelligence Science and Technology Peking University)
YUHUI YUAN (Microsoft Research Asia)
Henghui Ding (Swiss Federal Institute of Technology)
Xiao Luo (Peking University)
Weihong Lin (Microsoft)
Ding Jia (Peking University)
Zheng Zhang (MSRA)
Chao Zhang (Peking University)
Han Hu (Microsoft Research Asia)
More from the Same Authors
-
2021 Spotlight: Aligning Pretraining for Detection via Object-Level Contrastive Learning »
Fangyun Wei · Yue Gao · Zhirong Wu · Han Hu · Stephen Lin -
2021 Spotlight: Semi-Supervised Semantic Segmentation via Adaptive Equalization Learning »
Hanzhe Hu · Fangyun Wei · Han Hu · Qiwei Ye · Jinshi Cui · Liwei Wang -
2021 Spotlight: Bootstrap Your Object Detector via Mixed Training »
Mengde Xu · Zheng Zhang · Fangyun Wei · Yutong Lin · Yue Cao · Stephen Lin · Han Hu · Xiang Bai -
2022 Poster: Could Giant Pre-trained Image Models Extract Universal Representations? »
Yutong Lin · Ze Liu · Zheng Zhang · Han Hu · Nanning Zheng · Stephen Lin · Yue Cao -
2022 Poster: Learning Efficient Vision Transformers via Fine-Grained Manifold Distillation »
Zhiwei Hao · Jianyuan Guo · Ding Jia · Kai Han · Yehui Tang · Chao Zhang · Han Hu · Yunhe Wang -
2022 Spotlight: Lightning Talks 6B-3 »
Lingfeng Yang · Yao Lai · Zizheng Pan · Zhenyu Wang · Weicong Liang · Chuanyang Zheng · Jian-Wei Zhang · Peng Jin · Jing Liu · Xiuying Wei · Yao Mu · Xiang Li · YUHUI YUAN · Zizheng Pan · Yifan Sun · Yunchen Zhang · Jianfei Cai · Hao Luo · zheyang li · Jinfa Huang · Haoyu He · Yi Yang · Ping Luo · Fenglin Liu · Henghui Ding · Borui Zhao · Xiangguo Zhang · Kai Zhang · Pichao WANG · Bohan Zhuang · Wei Chen · Ruihao Gong · Zhi Yang · Xian Wu · Feng Ding · Jianfei Cai · Xiao Luo · Renjie Song · Weihong Lin · Jian Yang · Wenming Tan · Bohan Zhuang · Shanghang Zhang · Shen Ge · Fan Wang · Qi Zhang · Guoli Song · Jun Xiao · Hao Li · Ding Jia · David Clifton · Ye Ren · Fengwei Yu · Zheng Zhang · Jie Chen · Shiliang Pu · Xianglong Liu · Chao Zhang · Han Hu -
2022 Spotlight: Expediting Large-Scale Vision Transformer for Dense Prediction without Fine-tuning »
Weicong Liang · YUHUI YUAN · Henghui Ding · Xiao Luo · Weihong Lin · Ding Jia · Zheng Zhang · Chao Zhang · Han Hu -
2022 Spotlight: Lightning Talks 2A-3 »
David Buterez · Chengan He · Xuan Kan · Yutong Lin · Konstantin Schürholt · Yu Yang · Louis Annabi · Wei Dai · Xiaotian Cheng · Alexandre Pitti · Ze Liu · Jon Paul Janet · Jun Saito · Boris Knyazev · Mathias Quoy · Zheng Zhang · James Zachary · Steven J Kiddle · Xavier Giro-i-Nieto · Chang Liu · Hejie Cui · Zilong Zhang · Hakan Bilen · Damian Borth · Dino Oglic · Holly Rushmeier · Han Hu · Xiangyang Ji · Yi Zhou · Nanning Zheng · Ying Guo · Pietro Liò · Stephen Lin · Carl Yang · Yue Cao -
2022 Spotlight: Could Giant Pre-trained Image Models Extract Universal Representations? »
Yutong Lin · Ze Liu · Zheng Zhang · Han Hu · Nanning Zheng · Stephen Lin · Yue Cao -
2022 Poster: Degradation-Aware Unfolding Half-Shuffle Transformer for Spectral Compressive Imaging »
Yuanhao Cai · Jing Lin · Haoqian Wang · Xin Yuan · Henghui Ding · Yulun Zhang · Radu Timofte · Luc V Gool -
2021 Poster: HRFormer: High-Resolution Vision Transformer for Dense Predict »
YUHUI YUAN · Rao Fu · Lang Huang · Weihong Lin · Chao Zhang · Xilin Chen · Jingdong Wang -
2021 Poster: Aligning Pretraining for Detection via Object-Level Contrastive Learning »
Fangyun Wei · Yue Gao · Zhirong Wu · Han Hu · Stephen Lin -
2021 Poster: Semi-Supervised Semantic Segmentation via Adaptive Equalization Learning »
Hanzhe Hu · Fangyun Wei · Han Hu · Qiwei Ye · Jinshi Cui · Liwei Wang -
2021 Poster: Bootstrap Your Object Detector via Mixed Training »
Mengde Xu · Zheng Zhang · Fangyun Wei · Yutong Lin · Yue Cao · Stephen Lin · Han Hu · Xiang Bai -
2020 Poster: Self-Adaptive Training: beyond Empirical Risk Minimization »
Lang Huang · Chao Zhang · Hongyang Zhang -
2020 Poster: RelationNet++: Bridging Visual Representations for Object Detection via Transformer Decoder »
Cheng Chi · Fangyun Wei · Han Hu -
2020 Spotlight: RelationNet++: Bridging Visual Representations for Object Detection via Transformer Decoder »
Cheng Chi · Fangyun Wei · Han Hu -
2020 Poster: RepPoints v2: Verification Meets Regression for Object Detection »
Yihong Chen · Zheng Zhang · Yue Cao · Liwei Wang · Stephen Lin · Han Hu -
2020 Poster: Parametric Instance Classification for Unsupervised Visual Feature learning »
Yue Cao · Zhenda Xie · Bin Liu · Yutong Lin · Zheng Zhang · Han Hu -
2018 Poster: Joint Sub-bands Learning with Clique Structures for Wavelet Domain Super-Resolution »
Zhisheng Zhong · Tiancheng Shen · Yibo Yang · Zhouchen Lin · Chao Zhang -
2018 Poster: Greedy Hash: Towards Fast Optimization for Accurate Hash Coding in CNN »
Shupeng Su · Chao Zhang · Kai Han · Yonghong Tian