Timezone: »
Learned optimizers are parametric algorithms that can themselves be trained to solve optimization problems. In contrast to baseline optimizers (such as momentum or Adam) that use simple update rules derived from theoretical principles, learned optimizers use flexible, high-dimensional, nonlinear parameterizations. Although this can lead to better performance, their inner workings remain a mystery. How is a given learned optimizer able to outperform a well tuned baseline? Has it learned a sophisticated combination of existing optimization techniques, or is it implementing completely new behavior? In this work, we address these questions by careful analysis and visualization of learned optimizers. We study learned optimizers trained from scratch on four disparate tasks, and discover that they have learned interpretable behavior, including: momentum, gradient clipping, learning rate schedules, and new forms of learning rate adaptation. Moreover, we show how dynamics and mechanisms inside of learned optimizers orchestrate these computations. Our results help elucidate the previously murky understanding of how learned optimizers work, and establish tools for interpreting future learned optimizers.
Author Information
Niru Maheswaranathan (Meta Platforms, Inc.)
David Sussillo (Stanford University)
Luke Metz (Google Brain)
Ruoxi Sun (Google)
Jascha Sohl-Dickstein (Google)
More from the Same Authors
-
2020 : Training more effective learned optimizers »
Luke Metz -
2021 : Fast Finite Width Neural Tangent Kernel »
Roman Novak · Jascha Sohl-Dickstein · Samuel Schoenholz -
2022 : Meta-Learning General-Purpose Learning Algorithms with Transformers »
Louis Kirsch · Luke Metz · James Harrison · Jascha Sohl-Dickstein -
2022 : Meta-Learning General-Purpose Learning Algorithms with Transformers »
Louis Kirsch · Luke Metz · James Harrison · Jascha Sohl-Dickstein -
2023 Poster: Variance-Reduced Gradient Estimation via Noise-Reuse in Online Evolution Strategies »
Oscar Li · James Harrison · Jascha Sohl-Dickstein · Virginia Smith · Luke Metz -
2022 Poster: A Closer Look at Learned Optimization: Stability, Robustness, and Inductive Biases »
James Harrison · Luke Metz · Jascha Sohl-Dickstein -
2022 Poster: Discovered Policy Optimisation »
Chris Lu · Jakub Kuba · Alistair Letcher · Luke Metz · Christian Schroeder de Witt · Jakob Foerster -
2021 : Luke Metz Q&A »
Luke Metz -
2021 : Luke Metz »
Luke Metz -
2021 Poster: Towards understanding retrosynthesis by energy-based models »
Ruoxi Sun · Hanjun Dai · Li Li · Steven Kearnes · Bo Dai -
2021 Poster: Reverse engineering recurrent neural networks with Jacobian switching linear dynamical systems »
Jimmy Smith · Scott Linderman · David Sussillo -
2021 Poster: Understanding How Encoder-Decoder Architectures Attend »
Kyle Aitken · Vinay Ramasesh · Yuan Cao · Niru Maheswaranathan -
2020 : Reverse engineering learned optimizers reveals known and novel mechanisms »
Niru Maheswaranathan · David Sussillo · Luke Metz · Ruoxi Sun · Jascha Sohl-Dickstein -
2020 Poster: Ridge Rider: Finding Diverse Solutions by Following Eigenvectors of the Hessian »
Jack Parker-Holder · Luke Metz · Cinjon Resnick · Hengyuan Hu · Adam Lerer · Alistair Letcher · Alexander Peysakhovich · Aldo Pacchiano · Jakob Foerster -
2020 Poster: Organizing recurrent network dynamics by task-computation to enable continual learning »
Lea Duncker · Laura N Driscoll · Krishna V Shenoy · Maneesh Sahani · David Sussillo -
2019 Poster: Universality and individuality in neural dynamics across large populations of recurrent networks »
Niru Maheswaranathan · Alex Williams · Matthew Golub · Surya Ganguli · David Sussillo -
2019 Spotlight: Universality and individuality in neural dynamics across large populations of recurrent networks »
Niru Maheswaranathan · Alex Williams · Matthew Golub · Surya Ganguli · David Sussillo -
2019 Poster: Scalable Bayesian inference of dendritic voltage via spatiotemporal recurrent state space models »
Ruoxi Sun · Ian Kinsella · Scott Linderman · Liam Paninski -
2019 Poster: From deep learning to mechanistic understanding in neuroscience: the structure of retinal prediction »
Hidenori Tanaka · Aran Nayebi · Niru Maheswaranathan · Lane McIntosh · Stephen Baccus · Surya Ganguli -
2019 Poster: Learning to Predict Without Looking Ahead: World Models Without Forward Prediction »
Daniel Freeman · David Ha · Luke Metz -
2019 Oral: Scalable Bayesian inference of dendritic voltage via spatiotemporal recurrent state space models »
Ruoxi Sun · Ian Kinsella · Scott Linderman · Liam Paninski -
2019 Poster: Reverse engineering recurrent networks for sentiment classification reveals line attractor dynamics »
Niru Maheswaranathan · Alex Williams · Matthew Golub · Surya Ganguli · David Sussillo -
2018 : Learned optimizers that outperform SGD on wall-clock and validation loss »
Luke Metz -
2015 Workshop: Statistical Methods for Understanding Neural Systems »
Alyson Fletcher · Jakob H Macke · Ryan Adams · Jascha Sohl-Dickstein -
2012 Poster: Training sparse natural image models with a fast Gibbs sampler of an extended state space »
Lucas Theis · Jascha Sohl-Dickstein · Matthias Bethge