Timezone: »
Training deep networks on increasingly large-scale datasets is computationally challenging. In this work, we explore the problem of ``\textit{how to accelerate the convergence of adaptive gradient algorithms in a general manner}", and aim at providing practical insights to boost the training efficiency. To this end, we propose an effective {Weight-decay-Integrated Nesterov acceleration} (Win) for adaptive algorithms to enhance their convergence speed. Taking AdamW and Adam as examples, we minimize a dynamical loss per iteration which combines the vanilla training loss and a dynamic regularizer inspired by proximal point method (PPM) to improve the convexity of the problem. Then we respectively use the first- and second-order Taylor approximations of vanilla loss to update the variable twice while fixing the above dynamic regularization brought by PPM. In this way, we arrive at our Win acceleration (like Nesterov acceleration) for AdamW and Adam that uses a conservative step and a reckless step to update twice and then linearly combines these two updates for acceleration. Next, we extend this Win acceleration to LAMB and SGD. Our transparent acceleration derivation could provide insights for other accelerated methods and their integration into adaptive algorithms. Besides, we prove the convergence of Win-accelerated adaptive algorithms and justify their convergence superiority over their non-accelerated counterparts by taking AdamW and Adam as examples. Experimental results testify the faster convergence speed and superior performance of our Win-accelerated AdamW, Adam, LAMB and SGD over their vanilla counterparts on vision classification tasks and language modeling tasks with CNN and Transformer backbones.
Author Information
Pan Zhou (SEA AI Lab)
Currently, I am a senior Research Scientist in Sea AI Lab of Sea group. Before, I worked in Salesforce as a research scientist during 2019 to 2021. I completed my Ph.D. degree in 2019 at the National University of Singapore (NUS), fortunately advised by Prof. Jiashi Feng and Prof. Shuicheng Yan. Before studying in NUS, I graduated from Peking University (PKU) in 2016 and during this period, I was fortunately directed by Prof. Zhouchen Lin and Prof. Chao Zhang in ZERO Lab. During the research period, I also work closely with Prof. Xiaotong Yuan. I also spend several wonderful months in 2018 at Georgia Tech as visiting student hosted by Prof. Huan Xu.
Xingyu Xie (Peking University)
Shuicheng Yan (Sea AI Lab)
More from the Same Authors
-
2020 : Task Similarity Aware Meta Learning: Theory-inspired Improvement on MAML »
Pan Zhou -
2021 Spotlight: A Theory-Driven Self-Labeling Refinement Method for Contrastive Representation Learning »
Pan Zhou · Caiming Xiong · Xiaotong Yuan · Steven Chu Hong Hoi -
2022 Poster: Inception Transformer »
Chenyang Si · Weihao Yu · Pan Zhou · Yichen Zhou · Xinchao Wang · Shuicheng Yan -
2022 : Adan: Adaptive Nesterov Momentum Algorithm for Faster Optimizing Deep Models »
Xingyu Xie · Pan Zhou · Huan Li · Zhouchen Lin · Shuicheng Yan -
2022 : DIMENSION-REDUCED ADAPTIVE GRADIENT METHOD »
Jingyang Li · Pan Zhou · Kuangyu Ding · Kim-Chuan Toh · Yinyu Ye -
2022 : Boosting Offline Reinforcement Learning via Data Resampling »
Yang Yue · Bingyi Kang · Xiao Ma · Zhongwen Xu · Gao Huang · Shuicheng Yan -
2022 : Mutual Information Regularized Offline Reinforcement Learning »
Xiao Ma · Bingyi Kang · Zhongwen Xu · Min Lin · Shuicheng Yan -
2022 : HloEnv: A Graph Rewrite Environment for Deep Learning Compiler Optimization Research »
Chin Yang Oh · Kunhao Zheng · Bingyi Kang · Xinyi Wan · Zhongwen Xu · Shuicheng Yan · Min Lin · Yangzihao Wang -
2022 : Efficient Offline Policy Optimization with a Learned Model »
Zichen Liu · Siyi Li · Wee Sun Lee · Shuicheng Yan · Zhongwen Xu -
2022 : Visual Imitation Learning with Patch Rewards »
Minghuan Liu · Tairan He · Weinan Zhang · Shuicheng Yan · Zhongwen Xu -
2023 Poster: Task-Robust Pre-Training for Worst-Case Downstream Adaptation »
Jianghui Wang · Yang Chen · Xingyu Xie · Cong Fang · Zhouchen Lin -
2023 Poster: ScaleLong: Towards More Stable Training of Diffusion Model via Scaling Network Long Skip Connection »
Zhongzhan Huang · Pan Zhou · Shuicheng Yan · Liang Lin -
2022 Spotlight: Inception Transformer »
Chenyang Si · Weihao Yu · Pan Zhou · Yichen Zhou · Xinchao Wang · Shuicheng Yan -
2022 Spotlight: Lightning Talks 2B-1 »
Yehui Tang · Jian Wang · Zheng Chen · man zhou · Peng Gao · Chenyang Si · SHANGKUN SUN · Yixing Xu · Weihao Yu · Xinghao Chen · Kai Han · Hu Yu · Yulun Zhang · Chenhui Gou · Teli Ma · Yuanqi Chen · Yunhe Wang · Hongsheng Li · Jinjin Gu · Jianyuan Guo · Qiman Wu · Pan Zhou · Yu Zhu · Jie Huang · Chang Xu · Yichen Zhou · Haocheng Feng · Guodong Guo · yongbing zhang · Ziyi Lin · Feng Zhao · Ge Li · Junyu Han · Jinwei Gu · Jifeng Dai · Chao Xu · Xinchao Wang · Linghe Kong · Shuicheng Yan · Yu Qiao · Chen Change Loy · Xin Yuan · Errui Ding · Yunhe Wang · Deyu Meng · Jingdong Wang · Chongyi Li -
2022 Poster: EnvPool: A Highly Parallel Reinforcement Learning Environment Execution Engine »
Jiayi Weng · Min Lin · Shengyi Huang · Bo Liu · Denys Makoviichuk · Viktor Makoviychuk · Zichen Liu · Yufan Song · Ting Luo · Yukun Jiang · Zhongwen Xu · Shuicheng Yan -
2021 Poster: Towards Understanding Why Lookahead Generalizes Better Than SGD and Beyond »
Pan Zhou · Hanshu Yan · Xiaotong Yuan · Jiashi Feng · Shuicheng Yan -
2021 Poster: A Theory-Driven Self-Labeling Refinement Method for Contrastive Representation Learning »
Pan Zhou · Caiming Xiong · Xiaotong Yuan · Steven Chu Hong Hoi -
2020 Poster: Towards Theoretically Understanding Why Sgd Generalizes Better Than Adam in Deep Learning »
Pan Zhou · Jiashi Feng · Chao Ma · Caiming Xiong · Steven Chu Hong Hoi · Weinan E -
2020 Poster: Theory-Inspired Path-Regularized Differential Network Architecture Search »
Pan Zhou · Caiming Xiong · Richard Socher · Steven Chu Hong Hoi -
2020 Oral: Theory-Inspired Path-Regularized Differential Network Architecture Search »
Pan Zhou · Caiming Xiong · Richard Socher · Steven Chu Hong Hoi -
2020 Poster: Improving GAN Training with Probability Ratio Clipping and Sample Reweighting »
Yue Wu · Pan Zhou · Andrew Wilson · Eric Xing · Zhiting Hu -
2019 Poster: Efficient Meta Learning via Minibatch Proximal Update »
Pan Zhou · Xiaotong Yuan · Huan Xu · Shuicheng Yan · Jiashi Feng -
2019 Spotlight: Efficient Meta Learning via Minibatch Proximal Update »
Pan Zhou · Xiaotong Yuan · Huan Xu · Shuicheng Yan · Jiashi Feng -
2018 Poster: New Insight into Hybrid Stochastic Gradient Descent: Beyond With-Replacement Sampling and Convexity »
Pan Zhou · Xiaotong Yuan · Jiashi Feng -
2018 Poster: Efficient Stochastic Gradient Hard Thresholding »
Pan Zhou · Xiaotong Yuan · Jiashi Feng