Workshop: OPT2020: Optimization for Machine Learning

Invited speaker: Practical Kronecker-factored BFGS and L-BFGS methods for training deep neural networks, Donald Goldfarb

Donald Goldfarb


In training deep neural network (DNN) models, computing and storing a full  BFGS approximation or storing a modest number of (step, change in gradient) vector pairs for use in an L-BFGS implementation is impractical. In our methods, we approximate the Hessian by a block-diagonal matrix and use the structure of the gradient and Hessian to further approximate these blocks, each of which corresponds to a layer, as the Kronecker product of two much smaller matrices, analogous to the approach in KFAC for approximating the Fisher matrix in a stochastic natural gradient method. Because of the indefinite and highly variable nature of the Hessian in a DNN, we also propose a new damping approach to keep the BFGS and L-BFGS approximations bounded, both above and below. In tests on autoencoder feed forward and convolutional neural network models, our methods outperformed KFAC and were competitive with state-of-the-art first-order stochastic methods.