Skip to yearly menu bar Skip to main content


Poster
in
Workshop: OPT 2023: Optimization for Machine Learning

How Over-Parameterization Slows Down Gradient Descent in Matrix Sensing: The Curses of Symmetry and Initialization

Nuoya Xiong · Lijun Ding · Simon Du


Abstract: This paper rigorously shows how over-parameterization dramatically changes the convergence behaviors of gradient descent (GD) for the matrix sensing problem, where the goal is to recover an unknown low-rank ground-truth matrix from near-isotropic linear measurements.First, we consider the symmetric setting with the symmetric parameterization where MRn×n is a positive semi-definite unknown matrix of rank rn, and one uses a symmetric parameterization XX to learn M. Here XRn×k with k>r is the factor matrix.We give a novel Ω(1/T2) lower bound of randomly initialized GD for the over-parameterized case (k>r) where T is the number of iterations.This is in stark contrast to the exact-parameterization scenario (k=r) where the convergence rate is exp(Ω(T)).Next, we study asymmetric setting where MRn1×n2 is the unknown matrix of rank rmin{n1,n2}, and one uses an asymmetric parameterization FG to learn M where FRn1×k and GRn2×k.We give the first global exact convergence result of randomly initialized GD for the exact-parameterization case (k=r) with an exp(Ω(T)) rate.Furthermore, we give the first global exact convergence result for the over-parameterization case (k>r) with an exp(Ω(α2T)) rate where α is the initialization scale.This linear convergence result in the over-parameterization case is especially significant because one can apply the asymmetric parameterization to the symmetric setting to speed up from Ω(1/T2) to linear convergence. Therefore, we identify a surprising phenomenon: asymmetric parameterization can exponentially speed up convergence. Equally surprising is our analysis that highlights the importance of imbalance between F and G. This is in sharp contrast to prior works which emphasize balance. We further give an example showing the dependency on α in the convergence rate is unavoidable in the worst case. On the other hand, we propose a novel method that only modifies one step of GD and obtains a convergence rate independent of α, recovering the rate in the exact-parameterization case. We provide empirical studies to verify our theoretical findings.

Chat is not available.