Timezone: »
Stochastic gradient descent (SGD) is a pillar of modern machine learning, serving as the go-to optimization algorithm for a diverse array of problems. While the empirical success of SGD is often attributed to its computational efficiency and favorable generalization behavior, neither effect is well understood and disentangling them remains an open problem. Even in the simple setting of convex quadratic problems, worst-case analyses give an asymptotic convergence rate for SGD that is no better than full-batch gradient descent (GD), and the purported implicit regularization effects of SGD lack a precise explanation. In this work, we study the dynamics of multi-pass SGD on high-dimensional convex quadratics and establish an asymptotic equivalence to a stochastic differential equation, which we call homogenized stochastic gradient descent (HSGD), whose solutions we characterize explicitly in terms of a Volterra integral equation. These results yield precise formulas for the learning and risk trajectories, which reveal a mechanism of implicit conditioning that explains the efficiency of SGD relative to GD. We also prove that the noise from SGD negatively impacts generalization performance, ruling out the possibility of any type of implicit regularization in this context. Finally, we show how to adapt the HSGD formalism to include streaming SGD, which allows us to produce an exact prediction for the excess risk of multi-pass SGD relative to that of streaming SGD (bootstrap risk).
Author Information
Courtney Paquette (McGill University)
Elliot Paquette (McGill University)
Ben Adlam (Google)
Jeffrey Pennington (Google Brain)
More from the Same Authors
-
2022 : A Second-order Regression Model Shows Edge of Stability Behavior »
Fabian Pedregosa · Atish Agarwala · Jeffrey Pennington -
2023 Workshop: OPT 2023: Optimization for Machine Learning »
Cristóbal Guzmán · Courtney Paquette · Katya Scheinberg · Aaron Sidford · Sebastian Stich -
2022 Workshop: OPT 2022: Optimization for Machine Learning »
Courtney Paquette · Sebastian Stich · Quanquan Gu · Cristóbal Guzmán · John Duchi -
2022 Poster: Precise Learning Curves and Higher-Order Scalings for Dot-product Kernel Regression »
Lechao Xiao · Hong Hu · Theodor Misiakiewicz · Yue Lu · Jeffrey Pennington -
2022 Poster: Trajectory of Mini-Batch Momentum: Batch Size Saturation and Convergence in High Dimensions »
Kiwon Lee · Andrew Cheng · Elliot Paquette · Courtney Paquette -
2021 Poster: Overparameterization Improves Robustness to Covariate Shift in High Dimensions »
Nilesh Tripuraneni · Ben Adlam · Jeffrey Pennington -
2021 Poster: Dynamics of Stochastic Momentum Methods on Large-scale, Quadratic Models »
Courtney Paquette · Elliot Paquette -
2020 Workshop: OPT2020: Optimization for Machine Learning »
Courtney Paquette · Mark Schmidt · Sebastian Stich · Quanquan Gu · Martin Takac -
2020 Poster: Finite Versus Infinite Neural Networks: an Empirical Study »
Jaehoon Lee · Samuel Schoenholz · Jeffrey Pennington · Ben Adlam · Lechao Xiao · Roman Novak · Jascha Sohl-Dickstein -
2020 Spotlight: Finite Versus Infinite Neural Networks: an Empirical Study »
Jaehoon Lee · Samuel Schoenholz · Jeffrey Pennington · Ben Adlam · Lechao Xiao · Roman Novak · Jascha Sohl-Dickstein -
2020 Poster: The Surprising Simplicity of the Early-Time Learning Dynamics of Neural Networks »
Wei Hu · Lechao Xiao · Ben Adlam · Jeffrey Pennington -
2020 Spotlight: The Surprising Simplicity of the Early-Time Learning Dynamics of Neural Networks »
Wei Hu · Lechao Xiao · Ben Adlam · Jeffrey Pennington -
2020 Poster: Understanding Double Descent Requires A Fine-Grained Bias-Variance Decomposition »
Ben Adlam · Jeffrey Pennington -
2019 Poster: Wide Neural Networks of Any Depth Evolve as Linear Models Under Gradient Descent »
Jaehoon Lee · Lechao Xiao · Samuel Schoenholz · Yasaman Bahri · Roman Novak · Jascha Sohl-Dickstein · Jeffrey Pennington -
2018 Poster: The Spectrum of the Fisher Information Matrix of a Single-Hidden-Layer Neural Network »
Jeffrey Pennington · Pratik Worah -
2017 Spotlight: Nonlinear random matrix theory for deep learning »
Jeffrey Pennington · Pratik Worah -
2017 Poster: Nonlinear random matrix theory for deep learning »
Jeffrey Pennington · Pratik Worah -
2017 Poster: Resurrecting the sigmoid in deep learning through dynamical isometry: theory and practice »
Jeffrey Pennington · Samuel Schoenholz · Surya Ganguli -
2015 Poster: Spherical Random Features for Polynomial Kernels »
Jeffrey Pennington · Felix Yu · Sanjiv Kumar -
2015 Spotlight: Spherical Random Features for Polynomial Kernels »
Jeffrey Pennington · Felix Yu · Sanjiv Kumar