Timezone: »

Heavy-tailed noise does not explain the gap between SGD and Adam on Transformers
Jacques Chen · Frederik Kunstner · Mark Schmidt

The success of Adam on deep learning architectures such as transformers has made it the default option in many application where stochastic gradient descent (SGD) does not work. Our theoretical understanding of this discrepancy is lagging and has prevented us from significantly improving either algorithm. Recent work advanced the hypothesis that Adam, and other heuristics like gradient clipping, outperform SGD on language tasks because the distribution of the stochastic gradients is more heavy-tailed. This suggest that Adam performs better because it is more robust to outliers, which is a promising avenue for designing better stochastic gradient estimators. However, it is unclear whether heavy-tailed noise is the cause of this discrepancy or another symptom. Through experiments that control for stochasticity by increasing the batch size, we present evidence that stochasticity and heavy-tailed noise is not the major factor in this gap.

Author Information

Jacques Chen (University of British Columbia)
Frederik Kunstner (UBC)
Mark Schmidt (University of British Columbia)

Related Events (a corresponding poster, oral, or spotlight)

More from the Same Authors