Timezone: »
Parameter Free Dual Averaging: Optimizing Lipschitz Functions in a Single Pass
Aaron Defazio · Konstantin Mishchenko
Event URL: https://openreview.net/forum?id=vtv83s2Ps9 »
Both gradient descent and dual averaging for convex Lipschitz functionshave convergence rates that are highly dependent on the choice of learningrate. Even when the Lipschitz constant is known, setting the learning rate to achieve the optimal convergence rate requires knowing a bound on the distance from the initial point to the solution set $D$. A numberof approaches are known that relax this requirement, but they eitherrequire line searches, restarting (hyper-parameter grid search), or do not derivefrom the gradient descent or dual averaging frameworks (coin-betting).In this work we describe a single pass method, with no back-tracking or line searches, derived from dual averaging,which does not require knowledge of $D$ yet asymptotically achievesthe optimal rate of convergence for the complexity class of ConvexLipschitz functions.
Both gradient descent and dual averaging for convex Lipschitz functionshave convergence rates that are highly dependent on the choice of learningrate. Even when the Lipschitz constant is known, setting the learning rate to achieve the optimal convergence rate requires knowing a bound on the distance from the initial point to the solution set $D$. A numberof approaches are known that relax this requirement, but they eitherrequire line searches, restarting (hyper-parameter grid search), or do not derivefrom the gradient descent or dual averaging frameworks (coin-betting).In this work we describe a single pass method, with no back-tracking or line searches, derived from dual averaging,which does not require knowledge of $D$ yet asymptotically achievesthe optimal rate of convergence for the complexity class of ConvexLipschitz functions.
Author Information
Aaron Defazio (Facebook AI Research)
Konstantin Mishchenko (CNRS)
More from the Same Authors
-
2021 : On Server-Side Stepsizes in Federated Optimization: Theory Explaining the Heuristics »
Grigory Malinovsky · Konstantin Mishchenko · Peter Richtarik -
2022 : Asynchronous Optimization: Delays, Stability, and the Impact of Data Heterogeneity »
Konstantin Mishchenko -
2022 Poster: Asynchronous SGD Beats Minibatch SGD Under Arbitrary Delays »
Konstantin Mishchenko · Francis Bach · Mathieu Even · Blake Woodworth -
2020 Poster: Random Reshuffling: Simple Analysis with Vast Improvements »
Konstantin Mishchenko · Ahmed Khaled Ragab Bayoumi · Peter Richtarik -
2020 Poster: MRI Banding Removal via Adversarial Training »
Aaron Defazio · Tullie Murrell · Michael Recht -
2019 : Spotlight talks »
Damien Scieur · Konstantin Mishchenko · Rohan Anil -
2019 : Poster Session »
Eduard Gorbunov · Alexandre d'Aspremont · Lingxiao Wang · Liwei Wang · Boris Ginsburg · Alessio Quaglino · Camille Castera · Saurabh Adya · Diego Granziol · Rudrajit Das · Raghu Bollapragada · Fabian Pedregosa · Martin Takac · Majid Jahani · Sai Praneeth Karimireddy · Hilal Asi · Balint Daroczy · Leonard Adolphs · Aditya Rawal · Nicolas Brandt · Minhan Li · Giuseppe Ughi · Orlando Romero · Ivan Skorokhodov · Damien Scieur · Kiwook Bae · Konstantin Mishchenko · Rohan Anil · Vatsal Sharan · Aditya Balu · Chao Chen · Zhewei Yao · Tolga Ergen · Paul Grigas · Chris Junchi Li · Jimmy Ba · Stephen J Roberts · Sharan Vaswani · Armin Eftekhari · Chhavi Sharma -
2019 Poster: On the Ineffectiveness of Variance Reduced Optimization for Deep Learning »
Aaron Defazio · Leon Bottou -
2019 Poster: On the Curved Geometry of Accelerated Optimization »
Aaron Defazio -
2018 Poster: SEGA: Variance Reduction via Gradient Sketching »
Filip Hanzely · Konstantin Mishchenko · Peter Richtarik