Poster
Global Convergence of Gradient Descent for Deep Linear Residual Networks
Lei Wu · Qingcan Wang · Chao Ma
East Exhibition Hall B, C #201
Keywords: [ Optimization for Deep Networks ] [ Deep Learning ] [ Optimization -> Non-Convex Optimization; Theory -> Computational Complexity; Theory ] [ Learning Theory ]
[
Abstract
]
Abstract:
We analyze the global convergence of gradient descent for deep linear residual
networks by proposing a new initialization: zero-asymmetric (ZAS)
initialization. It is motivated by avoiding stable manifolds of saddle points.
We prove that under the ZAS initialization, for an arbitrary target matrix,
gradient descent converges to an ε-optimal point in O(L3log(1/ε)) iterations, which scales polynomially with the
network depth L. Our result and the exp(Ω(L)) convergence time for the
standard initialization (Xavier or near-identity)
\cite{shamir2018exponential} together demonstrate the importance of the
residual structure and the initialization in the optimization for deep linear
neural networks, especially when L is large.
Live content is unavailable. Log in and register to view live content