Workshop: OPT 2021: Optimization for Machine Learning

Spherical Perspective on Learning with Normalization Layers

Simon Roburin · Yann de Mont-Marin · Andrei Bursuc · Renaud Marlet · Patrick Pérez · Mathieu Aubry

Abstract: Normalization Layers (NL) are widely used in modern deep-learning architectures. Despite their apparent simplicity, their effect on optimization is not yet fully understood. We introduce a spherical framework to study the optimization of neural networks with NL from a geometric perspective. Concretely, we leverage the radial invariance of groups of parameters to translate the optimization steps on the $L_2$ unit hypersphere. This formulation and the associated geometric interpretation shed new light on the training dynamics. We use it to derive the first effective learning rate expression of Adam. We then show theoretically and empirically that, in the presence of NL, performing SGD alone is actually equivalent to a variant of Adam constrained to the unit hypersphere.