Timezone: »

Xingyu Xie · Pan Zhou · Huan Li · Zhouchen Lin · Shuicheng Yan

Adaptive gradient algorithms combine the moving average idea with heavy ball acceleration to estimate accurate first- and second-order moments of the gradient for accelerating convergence. But Nesterov acceleration which converges faster than heavy ball acceleration in theory and also in many empirical cases, is much less investigated under the adaptive gradient setting. In this work, we propose the ADAptive Nesterov momentum algorithm (Adan) to speed up the training of deep neural networks. Adan first reformulates the vanilla Nesterov acceleration to develop a new Nesterov momentum estimation (NME) method that avoids the extra computation and memory overhead of computing gradient at the extrapolation point. Then Adan adopts NME to estimate the first- and second-order gradient moments in adaptive gradient algorithms for convergence acceleration. Besides, we prove that Adan finds an $\epsilon$-approximate stationary point within $\order{\epsilon^{-4}}$ stochastic gradient complexity on the non-convex stochastic problems, matching the best-known lower bound. Extensive experimental results show that Adan surpasses the corresponding SoTA optimizers for vision, language, and RL tasks and sets new SoTAs for many popular networks and frameworks, e.g., ResNet, ConvNext, ViT, Swin, MAE, Transformer-XL, and BERT. More surprisingly, Adan can use half of the training cost (epochs) of SoTA optimizers to achieve higher or comparable performance on ViT, ResNet, MAE, e.t.c, and also shows great tolerance to a large range of minibatch size, e.g., from 1k to 32k.

#### Author Information

##### Pan Zhou (SEA AI Lab)

Currently, I am a senior Research Scientist in Sea AI Lab of Sea group. Before, I worked in Salesforce as a research scientist during 2019 to 2021. I completed my Ph.D. degree in 2019 at the National University of Singapore (NUS), fortunately advised by Prof. Jiashi Feng and Prof. Shuicheng Yan. Before studying in NUS, I graduated from Peking University (PKU) in 2016 and during this period, I was fortunately directed by Prof. Zhouchen Lin and Prof. Chao Zhang in ZERO Lab. During the research period, I also work closely with Prof. Xiaotong Yuan. I also spend several wonderful months in 2018 at Georgia Tech as visiting student hosted by Prof. Huan Xu.