NeurIPS Poster Momentum-Based Variance Reduction in Non-Convex SGD

Poster

Momentum-Based Variance Reduction in Non-Convex SGD

Ashok Cutkosky · Francesco Orabona

Keywords: [ Non-Convex Optimization ] [ Optimization ] [ Stochastic Optimization ]

[ Abstract ]

2019 Poster

Abstract: Variance reduction has emerged in recent years as a strong competitor to stochastic gradient descent in non-convex problems, providing the first algorithms to improve upon the converge rate of stochastic gradient descent for finding first-order critical points. However, variance reduction techniques typically require carefully tuned learning rates and willingness to use excessively large "mega-batches" in order to achieve their improved results. We present a new algorithm, STORM, that does not require any batches and makes use of adaptive learning rates, enabling simpler implementation and less hyperparameter tuning. Our technique for removing the batches uses a variant of momentum to achieve variance reduction in non-convex optimization. On smooth losses

F

, STORM finds a point

x

with

E [‖ \nabla F (x) ‖] \leq O (1 / \sqrt{T} + σ^{1 / 3} / T^{1 / 3})

T

iterations with

σ^{2}

variance in the gradients, matching the best-known rate but without requiring knowledge of

σ

Live content is unavailable. Log in and register to view live content