Timezone: »
Transformers have achieved remarkable success in several domains, ranging from natural language processing to computer vision. Nevertheless, it has been recently shown that stacking self-attention layers — the distinctive architectural component of Transformers — can result in rank collapse of the tokens’ representations at initialization. The question of if and how rank collapse affects training is still largely unanswered, and its investigation is necessary for a more comprehensive understanding of this architecture. In this work, we shed new light on the causes and the effects of this phenomenon. First, we show that rank collapse of the tokens’ representations hinders training by causing the gradients of the queries and keys to vanish at initialization. Furthermore, we provide a thorough description of the origin of rank collapse and discuss how to prevent it via an appropriate depth-dependent scaling of the residual branches. Finally, our analysis unveils that specific architectural hyperparameters affect the gradients of queries, keys and values differently, leading to disproportionate gradient norms. This suggests an explanation for the widespread use of adaptive methods for Transformers' optimization.
Author Information
Lorenzo Noci (ETH Zürich)
Sotiris Anagnostidis (ETH Zurich)
Luca Biggio (ETH Zürich)
Antonio Orvieto (ETH Zurich)
PhD Student at ETH Zurich. I’m interested in the design and analysis of optimization algorithms for deep learning. Interned at DeepMind, MILA, and Meta. All publications at http://orvi.altervista.org/ Looking for postdoc positions! :) antonio.orvieto@inf.ethz.ch
Sidak Pal Singh (ETH Zürich)
Aurelien Lucchi (Swiss Federal Institute of Technology)
More from the Same Authors
-
2020 : Poster #12 »
Luca Biggio -
2020 : Uncertainty-aware Remaining Useful Life predictors »
Luca Biggio · Manuel Arias Chao · Olga Fink -
2021 : Differentiable Strong Lensing for Complex Lens Modelling »
Luca Biggio -
2022 : Batch size selection by stochastic optimal contro »
Jim Zhao · Aurelien Lucchi · Frank Proske · Antonio Orvieto · Hans Kersting -
2022 : Fast kinematics modeling for conjunction with lens image modeling »
Matthew Gomer · Luca Biggio · Sebastian Ertl · Han Wang · Aymeric Galan · Lyne Van de Vyvere · Dominique Sluse · Georgios Vernardos · Sherry Suyu -
2022 : Cosmology from Galaxy Redshift Surveys with PointNet »
Sotiris Anagnostidis · Arne Thomsen · Alexandre Refregier · Tomasz Kacprzak · Luca Biggio · Thomas Hofmann · Tilman Tröster -
2022 : Privileged Deep Symbolic Regression »
Luca Biggio · Tommaso Bendinelli · Pierre-alexandre Kamienny -
2022 : Achieving a Better Stability-Plasticity Trade-off via Auxiliary Networks in Continual Learning »
Sanghwan Kim · Lorenzo Noci · Antonio Orvieto · Thomas Hofmann -
2022 Poster: On the Theoretical Properties of Noise Correlation in Stochastic Optimization »
Aurelien Lucchi · Frank Proske · Antonio Orvieto · Francis Bach · Hans Kersting -
2022 Poster: Dynamics of SGD with Stochastic Polyak Stepsizes: Truly Adaptive Variants and Convergence to Exact Solution »
Antonio Orvieto · Simon Lacoste-Julien · Nicolas Loizou -
2021 : Empirics on the expressiveness of Randomized Signature »
Enea Monzio Compagnoni · Luca Biggio · Antonio Orvieto -
2021 Poster: Analytic Insights into Structure and Rank of Neural Network Hessian Maps »
Sidak Pal Singh · Gregor Bachmann · Thomas Hofmann -
2021 Poster: Rethinking the Variational Interpretation of Accelerated Optimization Methods »
Peiyuan Zhang · Antonio Orvieto · Hadi Daneshmand -
2021 Poster: On the Second-order Convergence Properties of Random Search Methods »
Aurelien Lucchi · Antonio Orvieto · Adamos Solomou -
2020 Poster: Batch normalization provably avoids ranks collapse for randomly initialised deep networks »
Hadi Daneshmand · Jonas Kohler · Francis Bach · Thomas Hofmann · Aurelien Lucchi -
2020 Poster: Model Fusion via Optimal Transport »
Sidak Pal Singh · Martin Jaggi -
2020 Poster: WoodFisher: Efficient Second-Order Approximation for Neural Network Compression »
Sidak Pal Singh · Dan Alistarh -
2020 Poster: Convolutional Generation of Textured 3D Meshes »
Dario Pavllo · Graham Spinks · Thomas Hofmann · Marie-Francine Moens · Aurelien Lucchi -
2020 Oral: Convolutional Generation of Textured 3D Meshes »
Dario Pavllo · Graham Spinks · Thomas Hofmann · Marie-Francine Moens · Aurelien Lucchi -
2019 : Spotlight talks »
Paul Grigas · Zhewei Yao · Aurelien Lucchi · Si Yi Meng -
2019 Poster: Shadowing Properties of Optimization Algorithms »
Antonio Orvieto · Aurelien Lucchi -
2019 Poster: Continuous-time Models for Stochastic Optimization Algorithms »
Antonio Orvieto · Aurelien Lucchi -
2019 Poster: A Domain Agnostic Measure for Monitoring and Evaluating GANs »
Paulina Grnarova · Kfir Y. Levy · Aurelien Lucchi · Nathanael Perraudin · Ian Goodfellow · Thomas Hofmann · Andreas Krause -
2017 Poster: Stabilizing Training of Generative Adversarial Networks through Regularization »
Kevin Roth · Aurelien Lucchi · Sebastian Nowozin · Thomas Hofmann -
2016 Poster: Adaptive Newton Method for Empirical Risk Minimization to Statistical Accuracy »
Aryan Mokhtari · Hadi Daneshmand · Aurelien Lucchi · Thomas Hofmann · Alejandro Ribeiro -
2015 Poster: Variance Reduced Stochastic Gradient Descent with Neighbors »
Thomas Hofmann · Aurelien Lucchi · Simon Lacoste-Julien · Brian McWilliams