Timezone: »
Natural gradient descent, which preconditions a gradient descent update with the Fisher information matrix of the underlying statistical model, is a way to capture partial second-order information. Several highly visible works have advocated an approximation known as the empirical Fisher, drawing connections between approximate second-order methods and heuristics like Adam. We dispute this argument by showing that the empirical Fisher---unlike the Fisher---does not generally capture second-order information. We further argue that the conditions under which the empirical Fisher approaches the Fisher (and the Hessian) are unlikely to be met in practice, and that, even on simple optimization problems, the pathologies of the empirical Fisher can have undesirable effects.
Author Information
Frederik Kunstner (EPFL)
Philipp Hennig (University of Tübingen and MPI for Intelligent Systems Tübingen)
Lukas Balles (University of Tuebingen)
More from the Same Authors
-
2021 : Heavy-tailed noise does not explain the gap between SGD and Adam on Transformers »
Jacques Chen · Frederik Kunstner · Mark Schmidt -
2021 : Heavy-tailed noise does not explain the gap between SGD and Adam on Transformers »
Jacques Chen · Frederik Kunstner · Mark Schmidt -
2020 Poster: Probabilistic Linear Solvers for Machine Learning »
Jonathan Wenger · Philipp Hennig -
2019 Poster: Convergence Guarantees for Adaptive Bayesian Quadrature Methods »
Motonobu Kanagawa · Philipp Hennig -
2018 Poster: SLANG: Fast Structured Covariance Approximations for Bayesian Deep Learning with Natural Gradient »
Aaron Mishkin · Frederik Kunstner · Didrik Nielsen · Mark Schmidt · Mohammad Emtiyaz Khan