Skip to yearly menu bar Skip to main content


Poster

SGD vs GD: Rank Deficiency in Linear Networks

Aditya Vardhan Varre · Margarita Sagitova · Nicolas Flammarion


Abstract:

In this article, we study the behaviour of continuous-time gradient methods on a two-layer linear network with square loss. A dichotomy between SGD and GD is revealed: GD preserves the rank at initialization while (label noise) SGD diminishes the rank regardless of the initialization. We demonstrate this rank deficiency by studying the time evolution of the determinant of a matrix of parameters. To further understand this phenomenon, we derive the stochastic differential equation (SDE) governing the eigenvalues of the parameter matrix. This SDE unveils a replusive force between the eigenvalues: a key regularization mechanism which induces rank deficiency. Our results are well supported by experiments illustrating the phenomenon beyond linear networks and regression tasks.

Live content is unavailable. Log in and register to view live content