Timezone: »

MCMAE: Masked Convolution Meets Masked Autoencoders
Peng Gao · Teli Ma · Hongsheng Li · Ziyi Lin · Jifeng Dai · Yu Qiao

Wed Nov 30 09:00 AM -- 11:00 AM (PST) @ Hall J #628

Vision Transformers (ViT) become widely-adopted architectures for various vision tasks. Masked auto-encoding for feature pretraining and multi-scale hybrid convolution-transformer architectures can further unleash the potentials of ViT, leading to state-of-the-art performances on image classification, detection and semantic segmentation. In this paper, our MCMAE framework demonstrates that multi-scale hybrid convolution-transformer can learn more discriminative representations via the mask auto-encoding scheme. However, directly using the original masking strategy leads to the heavy computational cost and pretraining-finetuning discrepancy. To tackle the issue, we adopt the masked convolution to prevent information leakage in the convolution blocks. A simple block-wise masking strategy is proposed to ensure computational efficiency. We also propose to more directly supervise the multi-scale features of the encoder to boost multi-scale features. Based on our pretrained MCMAE models, MCMAE-Base improves ImageNet-1K finetuning accuracy by 1.4% compared with MAE-Base. On object detection, MCMAE-Base finetuned for only 25 epochs surpasses MAE-Base fined-tuned for 100 epochs by 2.9% box AP and 2.2% mask AP respectively. Code and pretrained models are available at \url{https://github.com/Alpha-VL/ConvMAE}.

Author Information

Peng Gao (Shanghai AI Lab)
Teli Ma (Shanghai Artificial Intelligence Laboratory)
Hongsheng Li (The Chinese University of Hong Kong)
Ziyi Lin (The Chinese University of Hong Kong)
Jifeng Dai (Tsinghua University)
Yu Qiao (Shenzhen Institutes of Advanced Technology, Chinese Academy of Sciences)

Related Events (a corresponding poster, oral, or spotlight)

More from the Same Authors