Skip to yearly menu bar Skip to main content

Workshop: OPT 2023: Optimization for Machine Learning

Why Adam Outperforms Gradient Descent on Language Models: A Heavy-Tailed Class Imbalance Problem

Robin Yadav · Frederik Kunstner · Mark Schmidt · Alberto Bietti


We show that the heavy-tailed class imbalance found in language modeling tasks leads to difficul-ties in optimization dynamics. When training with gradient descent, the loss associated with lowfrequency classes decreases slower than the loss associated with high frequency classes. Under theheavy-tailed class imbalance found in language modeling tasks, most samples are from classes oflow relative frequency, leading to overall slow decreasing on the average loss. Sign-based optimizerssuch as Adam and sign descent do not suffer from this problem, and lead to decrease on all classes.We give evidence of this behavior on training for a 2-layer transformer on language data, a linearmodel on synthetic data whose only property is a heavy-tailed class distribution, and a convolutionalnetwork on a modified MNIST dataset made to exhibit heavy-tailed class imbalance.

Chat is not available.