Timezone: »
The success of Adam on deep learning architectures such as transformers has made it the default option in many application where stochastic gradient descent (SGD) does not work. Our theoretical understanding of this discrepancy is lagging and has prevented us from significantly improving either algorithm. Recent work advanced the hypothesis that Adam, and other heuristics like gradient clipping, outperform SGD on language tasks because the distribution of the stochastic gradients is more heavy-tailed. This suggest that Adam performs better because it is more robust to outliers, which is a promising avenue for designing better stochastic gradient estimators. However, it is unclear whether heavy-tailed noise is the cause of this discrepancy or another symptom. Through experiments that control for stochasticity by increasing the batch size, we present evidence that stochasticity and heavy-tailed noise is not the major factor in this gap.
Author Information
Jacques Chen (University of British Columbia)
Frederik Kunstner (UBC)
Mark Schmidt (University of British Columbia)
Related Events (a corresponding poster, oral, or spotlight)
-
2021 : Heavy-tailed noise does not explain the gap between SGD and Adam on Transformers »
Dates n/a. Room
More from the Same Authors
-
2021 : Faster Quasi-Newton Methods for Linear Composition Problems »
Betty Shea · Mark Schmidt -
2021 : A Closer Look at Gradient Estimators with Reinforcement Learning as Inference »
Jonathan Lavington · Michael Teng · Mark Schmidt · Frank Wood -
2021 : An Empirical Study of Non-Uniform Sampling in Off-Policy Reinforcement Learning for Continuous Control »
Nicholas Ioannidis · Jonathan Lavington · Mark Schmidt -
2022 : Target-based Surrogates for Stochastic Optimization »
Jonathan Lavington · Sharan Vaswani · Reza Babanezhad Harikandeh · Mark Schmidt · Nicolas Le Roux -
2022 : Fast Convergence of Greedy 2-Coordinate Updates for Optimizing with an Equality Constraint »
Amrutha Varshini Ramesh · Aaron Mishkin · Mark Schmidt -
2022 : Fast Convergence of Random Reshuffling under Interpolation and the Polyak-Łojasiewicz Condition »
Chen Fan · Christos Thrampoulidis · Mark Schmidt -
2022 : Practical Structured Riemannian Optimization with Momentum by using Generalized Normal Coordinates »
Wu Lin · Valentin Duruisseaux · Melvin Leok · Frank Nielsen · Mohammad Emtiyaz Khan · Mark Schmidt -
2021 : Contributed talks in Session 4 (Zoom) »
Quanquan Gu · Agnieszka Słowik · Jacques Chen · Neha Wadia · Difan Zou -
2021 : Poster Session 2 (gather.town) »
Wenjie Li · Akhilesh Soni · Jinwuk Seok · Jianhao Ma · Jeffery Kline · Mathieu Tuli · Miaolan Xie · Robert Gower · Quanqi Hu · Matteo Cacciola · Yuanlu Bai · Boyue Li · Wenhao Zhan · Shentong Mo · Junhyung Lyle Kim · Sajad Fathi Hafshejani · Chris Junchi Li · Zhishuai Guo · Harshvardhan Harshvardhan · Neha Wadia · Tatjana Chavdarova · Difan Zou · Zixiang Chen · Aman Gupta · Jacques Chen · Betty Shea · Benoit Dherin · Aleksandr Beznosikov -
2020 : Closing remarks »
Quanquan Gu · Courtney Paquette · Mark Schmidt · Sebastian Stich · Martin Takac -
2020 : Live Q&A with Michael Friedlander (Zoom) »
Mark Schmidt -
2020 : Intro to Invited Speaker 8 »
Mark Schmidt -
2020 : Contributed talks in Session 3 (Zoom) »
Mark Schmidt · Zhan Gao · Wenjie Li · Preetum Nakkiran · Denny Wu · Chengrun Yang -
2020 : Live Q&A with Rachel Ward (Zoom) »
Mark Schmidt -
2020 : Live Q&A with Ashia Wilson (Zoom) »
Mark Schmidt -
2020 : Welcome remarks to Session 3 »
Mark Schmidt -
2020 Workshop: OPT2020: Optimization for Machine Learning »
Courtney Paquette · Mark Schmidt · Sebastian Stich · Quanquan Gu · Martin Takac -
2020 : Welcome event (gather.town) »
Quanquan Gu · Courtney Paquette · Mark Schmidt · Sebastian Stich · Martin Takac -
2020 Poster: Regret Bounds without Lipschitz Continuity: Online Learning with Relative-Lipschitz Losses »
Yihan Zhou · Victor Sanches Portella · Mark Schmidt · Nicholas Harvey -
2019 Poster: Painless Stochastic Gradient: Interpolation, Line-Search, and Convergence Rates »
Sharan Vaswani · Aaron Mishkin · Issam Laradji · Mark Schmidt · Gauthier Gidel · Simon Lacoste-Julien -
2019 Poster: Limitations of the empirical Fisher approximation for natural gradient descent »
Frederik Kunstner · Philipp Hennig · Lukas Balles -
2018 Poster: SLANG: Fast Structured Covariance Approximations for Bayesian Deep Learning with Natural Gradient »
Aaron Mishkin · Frederik Kunstner · Didrik Nielsen · Mark Schmidt · Mohammad Emtiyaz Khan -
2016 : Fast Patch-based Style Transfer of Arbitrary Style »
Tian Qi Chen · Mark Schmidt -
2015 Poster: StopWasting My Gradients: Practical SVRG »
Reza Babanezhad Harikandeh · Mohamed Osama Ahmed · Alim Virani · Mark Schmidt · Jakub Konečný · Scott Sallinen