Timezone: »
A Better Way to Decay: Proximal Gradient Training Algorithms for Neural Nets
Liu Yang · Jifan Zhang · Joseph Shenouda · Dimitris Papailiopoulos · Kangwook Lee · Robert Nowak
Event URL: https://openreview.net/forum?id=4y1xh8jClhC »
Weight decay is one of the most widely used forms of regularization in deep learning, and has been shown to improve generalization and robustness. The optimization objective driving weight decay is a sum of losses plus a term proportional to the sum of squared weights. This paper argues that stochastic gradient descent (SGD) may be an inefficient algorithm for this objective. For neural networks with ReLU activations, solutions to the weight decay objective are equivalent to those of a different objective in which the regularization term is instead a sum of products of $\ell_2$ (not squared) norms of the input and output weights associated each ReLU. This alternative \emph{(and effectively equivalent)} regularization suggests a novel proximal gradient algorithm for network training. Theory and experiments support the new training approach, showing that it can converge much faster to the \emph{sparse} solutions it shares with standard weight decay training.
Weight decay is one of the most widely used forms of regularization in deep learning, and has been shown to improve generalization and robustness. The optimization objective driving weight decay is a sum of losses plus a term proportional to the sum of squared weights. This paper argues that stochastic gradient descent (SGD) may be an inefficient algorithm for this objective. For neural networks with ReLU activations, solutions to the weight decay objective are equivalent to those of a different objective in which the regularization term is instead a sum of products of $\ell_2$ (not squared) norms of the input and output weights associated each ReLU. This alternative \emph{(and effectively equivalent)} regularization suggests a novel proximal gradient algorithm for network training. Theory and experiments support the new training approach, showing that it can converge much faster to the \emph{sparse} solutions it shares with standard weight decay training.
Author Information
Liu Yang (University of Wisconsin, Madison)
Jifan Zhang (University of Wisconsin)
Joseph Shenouda (University of Wisconsin Madison)
Dimitris Papailiopoulos (University of Wisconsin-Madison)
Kangwook Lee (UW Madison, Krafton)
Robert Nowak (University of Wisconsion-Madison)
More from the Same Authors
-
2022 : Active Learning is a Strong Baseline for Data Subset Selection »
Dongmin Park · Dimitris Papailiopoulos · Kangwook Lee -
2022 : Panel »
Mayee Chen · Alexander Ratner · Robert Nowak · Cody Coleman · Ramya Korlakai Vinayak -
2022 : Poster Session 2 »
Jinwuk Seok · Bo Liu · Ryotaro Mitsuboshi · David Martinez-Rubio · Weiqiang Zheng · Ilgee Hong · Chen Fan · Kazusato Oko · Bo Tang · Miao Cheng · Aaron Defazio · Tim G. J. Rudner · Gabriele Farina · Vishwak Srinivasan · Ruichen Jiang · Peng Wang · Jane Lee · Nathan Wycoff · Nikhil Ghosh · Yinbin Han · David Mueller · Liu Yang · Amrutha Varshini Ramesh · Siqi Zhang · Kaifeng Lyu · David Yunis · Kumar Kshitij Patel · Fangshuo Liao · Dmitrii Avdiukhin · Xiang Li · Sattar Vakili · Jiaxin Shi -
2022 Poster: Efficient Active Learning with Abstention »
Yinglun Zhu · Robert Nowak -
2022 Poster: Active Learning with Neural Networks: Insights from Nonparametric Statistics »
Yinglun Zhu · Robert Nowak -
2022 Poster: LIFT: Language-Interfaced Fine-Tuning for Non-language Machine Learning Tasks »
Tuan Dinh · Yuchen Zeng · Ruisu Zhang · Ziqian Lin · Michael Gira · Shashank Rajput · Jy-yong Sohn · Dimitris Papailiopoulos · Kangwook Lee -
2022 Poster: Score-based Generative Modeling Secretly Minimizes the Wasserstein Distance »
Dohyun Kwon · Ying Fan · Kangwook Lee -
2022 Poster: One for All: Simultaneous Metric and Preference Learning over Multiple Users »
Gregory Canal · Blake Mason · Ramya Korlakai Vinayak · Robert Nowak -
2022 Poster: Rare Gems: Finding Lottery Tickets at Initialization »
Kartik Sreenivasan · Jy-yong Sohn · Liu Yang · Matthew Grinde · Alliot Nagle · Hongyi Wang · Eric Xing · Kangwook Lee · Dimitris Papailiopoulos -
2021 Poster: An Exponential Improvement on the Memorization Capacity of Deep Threshold Networks »
Shashank Rajput · Kartik Sreenivasan · Dimitris Papailiopoulos · Amin Karbasi -
2021 Poster: Pure Exploration in Kernel and Neural Bandits »
Yinglun Zhu · Dongruo Zhou · Ruoxi Jiang · Quanquan Gu · Rebecca Willett · Robert Nowak -
2020 : Dataset Curation via Active Learning »
Robert Nowak -
2020 Poster: On Regret with Multiple Best Arms »
Yinglun Zhu · Robert Nowak -
2020 Poster: Bad Global Minima Exist and SGD Can Reach Them »
Shengchao Liu · Dimitris Papailiopoulos · Dimitris Achlioptas -
2020 Poster: Attack of the Tails: Yes, You Really Can Backdoor Federated Learning »
Hongyi Wang · Kartik Sreenivasan · Shashank Rajput · Harit Vishwakarma · Saurabh Agarwal · Jy-yong Sohn · Kangwook Lee · Dimitris Papailiopoulos -
2020 Poster: Optimal Lottery Tickets via Subset Sum: Logarithmic Over-Parameterization is Sufficient »
Ankit Pensia · Shashank Rajput · Alliot Nagle · Harit Vishwakarma · Dimitris Papailiopoulos -
2020 Poster: Finding All $\epsilon$-Good Arms in Stochastic Bandits »
Blake Mason · Lalit Jain · Ardhendu Tripathy · Robert Nowak -
2020 Spotlight: Optimal Lottery Tickets via Subset Sum: Logarithmic Over-Parameterization is Sufficient »
Ankit Pensia · Shashank Rajput · Alliot Nagle · Harit Vishwakarma · Dimitris Papailiopoulos -
2019 Poster: Learning Nearest Neighbor Graphs from Noisy Distance Samples »
Blake Mason · Ardhendu Tripathy · Robert Nowak -
2019 Poster: MaxGap Bandit: Adaptive Algorithms for Approximate Ranking »
Sumeet Katariya · Ardhendu Tripathy · Robert Nowak -
2019 Poster: DETOX: A Redundancy-based Framework for Faster and More Robust Gradient Aggregation »
Shashank Rajput · Hongyi Wang · Zachary Charles · Dimitris Papailiopoulos -
2018 Poster: The Effect of Network Width on the Performance of Large-batch Training »
Lingjiao Chen · Hongyi Wang · Jinman Zhao · Dimitris Papailiopoulos · Paraschos Koutris -
2018 Poster: ATOMO: Communication-efficient Learning via Atomic Sparsification »
Hongyi Wang · Scott Sievert · Shengchao Liu · Zachary Charles · Dimitris Papailiopoulos · Stephen Wright -
2017 Poster: Scalable Generalized Linear Bandits: Online Computation and Hashing »
Kwang-Sung Jun · Aniruddha Bhargava · Robert Nowak · Rebecca Willett -
2017 Poster: A KL-LUCB algorithm for Large-Scale Crowdsourcing »
Ervin Tanczos · Robert Nowak · Bob Mankoff -
2017 Poster: Learning Low-Dimensional Metrics »
Blake Mason · Lalit Jain · Robert Nowak -
2016 Poster: Cyclades: Conflict-free Asynchronous Machine Learning »
Xinghao Pan · Maximilian Lam · Stephen Tu · Dimitris Papailiopoulos · Ce Zhang · Michael Jordan · Kannan Ramchandran · Christopher RĂ© · Benjamin Recht -
2015 Poster: Orthogonal NMF through Subspace Exploration »
Megasthenis Asteris · Dimitris Papailiopoulos · Alex Dimakis -
2015 Poster: Sparse PCA via Bipartite Matchings »
Megasthenis Asteris · Dimitris Papailiopoulos · Anastasios Kyrillidis · Alex Dimakis -
2015 Poster: Parallel Correlation Clustering on Big Graphs »
Xinghao Pan · Dimitris Papailiopoulos · Samet Oymak · Benjamin Recht · Kannan Ramchandran · Michael Jordan