`

Timezone: »

 
Poster
Why are Adaptive Methods Good for Attention Models?
Jingzhao Zhang · Sai Praneeth Karimireddy · Andreas Veit · Seungyeon Kim · Sashank Reddi · Sanjiv Kumar · Suvrit Sra

Wed Dec 09 09:00 AM -- 11:00 AM (PST) @ Poster Session 3 #761

While stochastic gradient descent (SGD) is still the de facto algorithm in deep learning, adaptive methods like Clipped SGD/Adam have been observed to outperform SGD across important tasks, such as attention models. The settings under which SGD performs poorly in comparison to adaptive methods are not well understood yet. In this paper, we provide empirical and theoretical evidence that a heavy-tailed distribution of the noise in stochastic gradients is one cause of SGD's poor performance. We provide the first tight upper and lower convergence bounds for adaptive gradient methods under heavy-tailed noise. Further, we demonstrate how gradient clipping plays a key role in addressing heavy-tailed gradient noise. Subsequently, we show how clipping can be applied in practice by developing an adaptive coordinate-wise clipping algorithm (ACClip) and demonstrate its superior performance on BERT pretraining and finetuning tasks.

Author Information

Jingzhao Zhang (MIT)
Sai Praneeth Karimireddy (EPFL)

I am a second year PhD student working in convex and non-convex optimization with Prof. Martin Jaggi. My focus is on designing faster and more scalable optimization algorithms for machine learning. Some of my preliminary results and problems I am currently working on- 1. Robust accelerated algorithms - Nesterov acceleration modified to be robust to noise. 2. Faster algorithms which take second order information about the function into account. 3. A $O(1/t^2)$ rate *affine invariant* algorithm for constrained optimization. 4. Frank-Wolfe algorithm for non-smooth functions using 'noisy-smoothing'

Andreas Veit (Google)
Seungyeon Kim (Google Research)
Sashank Reddi (Google)
Sanjiv Kumar (Google Research)
Suvrit Sra (MIT)

Suvrit Sra is a faculty member within the EECS department at MIT, where he is also a core faculty member of IDSS, LIDS, MIT-ML Group, as well as the statistics and data science center. His research spans topics in optimization, matrix theory, differential geometry, and probability theory, which he connects with machine learning --- a key focus of his research is on the theme "Optimization for Machine Learning” (http://opt-ml.org)

More from the Same Authors