NeurIPS Poster Phase diagram of early training dynamics in deep neural networks: effect of the learning rate, depth, and width

Poster

Phase diagram of early training dynamics in deep neural networks: effect of the learning rate, depth, and width

Dayal Singh Kalra · Maissam Barkeshli

Great Hall & Hall B1+B2 (level 1) #814

[ Abstract ] [ Project Page ]

[ Paper] [ Slides] [ Poster] [ OpenReview]

Abstract: We systematically analyze optimization dynamics in deep neural networks (DNNs) trained with stochastic gradient descent (SGD) and study the effect of learning rate

η

$\eta$ , depth

d

$d$ , and width

w

$w$ of the neural network. By analyzing the maximum eigenvalue

λ_{t}^{H}

$\lambda^H_t$ of the Hessian of the loss, which is a measure of sharpness of the loss landscape, we find that the dynamics can show four distinct regimes: (i) an early time transient regime, (ii) an intermediate saturation regime, (iii) a progressive sharpening regime, and (iv) a late time "edge of stability" regime. The early and intermediate regimes (i) and (ii) exhibit a rich phase diagram depending on

η \equiv c / λ_{0}^{H}

$\eta \equiv c / \lambda_0^H$ ,

d

$d$ , and

w

$w$ . We identify several critical values of

c

$c$ , which separate qualitatively distinct phenomena in the early time dynamics of training loss and sharpness. Notably, we discover the opening up of a "sharpness reduction" phase, where sharpness decreases at early times, as

d

$d$ and

1 / w

$1/w$ are increased.

Chat is not available.