Timezone: »
Vision transformer (ViT) models exhibit substandard optimizability. In particular, they are sensitive to the choice of optimizer (AdamW vs. SGD), optimizer hyperparameters, and training schedule length. In comparison, modern convolutional neural networks are easier to optimize. Why is this the case? In this work, we conjecture that the issue lies with the patchify stem of ViT models, which is implemented by a stride-p p×p convolution (p = 16 by default) applied to the input image. This large-kernel plus large-stride convolution runs counter to typical design choices of convolutional layers in neural networks. To test whether this atypical design choice causes an issue, we analyze the optimization behavior of ViT models with their original patchify stem versus a simple counterpart where we replace the ViT stem by a small number of stacked stride-two 3×3 convolutions. While the vast majority of computation in the two ViT designs is identical, we find that this small change in early visual processing results in markedly different training behavior in terms of the sensitivity to optimization settings as well as the final model accuracy. Using a convolutional stem in ViT dramatically increases optimization stability and also improves peak performance (by ∼1-2% top-1 accuracy on ImageNet-1k), while maintaining flops and runtime. The improvement can be observed across the wide spectrum of model complexities (from 1G to 36G flops) and dataset scales (from ImageNet-1k to ImageNet-21k). These findings lead us to recommend using a standard, lightweight convolutional stem for ViT models in this regime as a more robust architectural choice compared to the original ViT model design.
Author Information
Tete Xiao (University of California Berkeley)
Mannat Singh (Facebook AI Research)
Eric Mintun (Facebook)
Trevor Darrell (Electrical Engineering & Computer Science Department)
Piotr Dollar (Facebook AI Research)
Ross Girshick (UC Berkeley)
More from the Same Authors
-
2021 : Benchmark for Compositional Text-to-Image Synthesis »
Dong Huk Park · Samaneh Azadi · Xihui Liu · Trevor Darrell · Anna Rohrbach -
2022 Poster: K-LITE: Learning Transferable Visual Models with External Knowledge »
Sheng Shen · Chunyuan Li · Xiaowei Hu · Yujia Xie · Jianwei Yang · Pengchuan Zhang · Zhe Gan · Lijuan Wang · Lu Yuan · Ce Liu · Kurt Keutzer · Trevor Darrell · Anna Rohrbach · Jianfeng Gao -
2022 Poster: Bringing Image Scene Structure to Video via Frame-Clip Consistency of Object Tokens »
Elad Ben Avraham · Roei Herzig · Karttikeya Mangalam · Amir Bar · Anna Rohrbach · Leonid Karlinsky · Trevor Darrell · Amir Globerson -
2022 Poster: Visual Prompting via Image Inpainting »
Amir Bar · Yossi Gandelsman · Trevor Darrell · Amir Globerson · Alexei Efros -
2021 Poster: CLIP-It! Language-Guided Video Summarization »
Medhini Narasimhan · Anna Rohrbach · Trevor Darrell -
2021 Poster: On Interaction Between Augmentations and Corruptions in Natural Corruption Robustness »
Eric Mintun · Alexander Kirillov · Saining Xie -
2021 Poster: Teachable Reinforcement Learning via Advice Distillation »
Olivia Watkins · Abhishek Gupta · Trevor Darrell · Pieter Abbeel · Jacob Andreas -
2015 Poster: Learning to Segment Object Candidates »
Pedro O. Pinheiro · Ronan Collobert · Piotr Dollar -
2015 Spotlight: Learning to Segment Object Candidates »
Pedro O. Pinheiro · Ronan Collobert · Piotr Dollar -
2014 Poster: LSDA: Large Scale Detection through Adaptation »
Judy Hoffman · Sergio Guadarrama · Eric Tzeng · Ronghang Hu · Jeff Donahue · Ross Girshick · Trevor Darrell · Kate Saenko -
2014 Poster: Local Decorrelation For Improved Pedestrian Detection »
Woonhyun Nam · Piotr Dollar · Joon Hee Han -
2006 Poster: Learning to Traverse Image Manifolds »
Piotr Dollar · Vincent Rabaud · Serge Belongie