Skip to yearly menu bar Skip to main content


Poster

Why Transformers Need Adam: A Hessian Perspective

Yushun Zhang · Congliang Chen · Tian Ding · Ziniu Li · Ruoyu Sun · Zhi-Quan Luo

East Exhibit Hall A-C #4803
[ ]
Thu 12 Dec 11 a.m. PST — 2 p.m. PST

Abstract:

SGD performs worse than Adam by a significant margin on Transformers, but the reason remains unclear.In this work, we provide an explanation of SGD's failure on Transformers through the lens of Hessian: (i) Transformers are "heterogeneous'': the Hessian spectrum across parameter blocks vary dramatically, a phenomenon we call `"block heterogeneity"; (ii) Heterogeneity hampers SGD: SGD performs badly on problems with block heterogeneity. To validate that heterogeneity hampers SGD, we check various Transformers, CNNs, MLPs, and quadratic problems, and find that SGD works well on problems without block heterogeneity but performs badly when the heterogeneity exists. Our initial theoretical analysis indicates that SGD fails because it applies one single learning rate to all blocks, which cannot handle the heterogeneity among blocks. The failure could be rescued if we could use coordinate-wise learning rates, as designed in Adam.

Live content is unavailable. Log in and register to view live content