Timezone: »
Towards the end of training, stochastic first-order methods such as SGD and Adam go into diffusion and no longer make significant progress. In contrast, Newton-type methods are highly efficient “close” to the optimum, in the deterministic case. Therefore, these methods might turn out to be a particularly efficient tool for the final phase of training in the stochastic deep learning context as well. In our work, we study this idea by conducting an empirical comparison of a second-order Hessian-free optimizer and different first-order strategies with learning rate decays for late-phase training. We show that performing a few costly but precise second-order steps can outperform first-order alternatives in wall-clock runtime.
Author Information
Lukas Tatzel (University of Tübingen)
Philipp Hennig (University of Tuebingen)
Frank Schneider (University of Tübingen)
More from the Same Authors
-
2022 Workshop: Has it Trained Yet? A Workshop for Algorithmic Efficiency in Practical Neural Network Training »
Frank Schneider · Zachary Nado · Philipp Hennig · George Dahl · Naman Agarwal -
2022 Poster: Posterior and Computational Uncertainty in Gaussian Processes »
Jonathan Wenger · Geoff Pleiss · Marvin Pförtner · Philipp Hennig · John Cunningham -
2022 Poster: Posterior Refinement Improves Sample Efficiency in Bayesian Neural Networks »
Agustinus Kristiadi · Runa Eschenhagen · Philipp Hennig -
2021 Poster: Cockpit: A Practical Debugging Tool for the Training of Deep Neural Networks »
Frank Schneider · Felix Dangel · Philipp Hennig