Timezone: »
Vision Transformers (ViTs) have recently achieved comparable or superior performance to Convolutional neural networks (CNNs) in computer vision. This empirical breakthrough is even more remarkable since ViTs discards spatial information by mixing patch embeddings and positional encodings and do not embed any visual inductive bias (e.g.\ spatial locality). Yet, recent work showed that while minimizing their training loss, ViTs specifically learn spatially delocalized patterns. This raises a central question: how do ViTs learn this pattern by solely minimizing their training loss using gradient-based methods from \emph{random initialization}? We propose a structured classification dataset and a simplified ViT model to provide preliminary theoretical justification of this phenomenon. Our model relies on a simplified attention mechanism --the positional attention mechanism-- where the attention matrix solely depends on the positional encodings. While the problem admits multiple solutions that generalize, we show that our model implicitly learns the spatial structure of the dataset while generalizing. We finally prove that learning the structure helps to sample-efficiently transfer to downstream datasets that share the same structure as the pre-training one but with different features. We empirically verify that ViTs using only the positional attention mechanism perform similarly to the original one on CIFAR-10/100, SVHN and ImageNet.
Author Information
Samy Jelassi (Princeton University)
Michael Sander (ENS Ulm, CNRS, Paris)
Yuanzhi Li (CMU)
More from the Same Authors
-
2021 : Understanding the Generalization of Adam in Learning Neural Networks with Proper Regularization »
Difan Zou · Yuan Cao · Yuanzhi Li · Quanquan Gu -
2021 : Understanding the Generalization of Adam in Learning Neural Networks with Proper Regularization »
Difan Zou · Yuan Cao · Yuanzhi Li · Quanquan Gu -
2022 : Toward Understanding Why Adam Converges Faster Than SGD for Transformers »
Yan Pan · Yuanzhi Li -
2022 : Sampling is as easy as learning the score: theory for diffusion models with minimal data assumptions »
Sitan Chen · Sinho Chewi · Jerry Li · Yuanzhi Li · Adil Salim · Anru Zhang -
2023 Poster: Read and Reap the Rewards: Learning to Play Atari with the Help of Instruction Manuals »
Yue Wu · Yewen Fan · Paul Pu Liang · Amos Azaria · Yuanzhi Li · Tom Mitchell -
2023 Poster: SPRING: Studying Papers and Reasoning to play Games »
Yue Wu · So Yeon Min · Shrimai Prabhumoye · Yonatan Bisk · Russ Salakhutdinov · Amos Azaria · Tom Mitchell · Yuanzhi Li -
2023 Poster: How Does Adaptive Optimization Impact Local Neural Network Geometry? »
Kaiqi Jiang · Dhruv Malik · Yuanzhi Li -
2023 Poster: The probability flow ODE is provably fast »
Sitan Chen · Sinho Chewi · Holden Lee · Yuanzhi Li · Jianfeng Lu · Adil Salim -
2022 Poster: Towards Understanding the Mixture-of-Experts Layer in Deep Learning »
Zixiang Chen · Yihe Deng · Yue Wu · Quanquan Gu · Yuanzhi Li -
2022 Poster: The Mechanism of Prediction Head in Non-contrastive Self-supervised Learning »
Zixin Wen · Yuanzhi Li -
2022 Poster: Do Residual Neural Networks discretize Neural Ordinary Differential Equations? »
Michael Sander · Pierre Ablin · Gabriel Peyré -
2022 Poster: Learning (Very) Simple Generative Models Is Hard »
Sitan Chen · Jerry Li · Yuanzhi Li -
2021 Poster: Local Signal Adaptivity: Provable Feature Learning in Neural Networks Beyond Kernels »
Stefani Karp · Ezra Winston · Yuanzhi Li · Aarti Singh -
2021 Poster: When Is Generalizable Reinforcement Learning Tractable? »
Dhruv Malik · Yuanzhi Li · Pradeep Ravikumar -
2020 Poster: A mean-field analysis of two-player zero-sum games »
Carles Domingo-Enrich · Samy Jelassi · Arthur Mensch · Grant Rotskoff · Joan Bruna -
2019 Poster: Towards closing the gap between the theory and practice of SVRG »
Othmane Sebbouh · Nidham Gazagnadou · Samy Jelassi · Francis Bach · Robert Gower -
2018 Poster: Smoothed analysis of the low-rank approach for smooth semidefinite programs »
Thomas Pumir · Samy Jelassi · Nicolas Boumal -
2018 Oral: Smoothed analysis of the low-rank approach for smooth semidefinite programs »
Thomas Pumir · Samy Jelassi · Nicolas Boumal