NeurIPS How Over-Parameterization Slows Down Gradient Descent in Matrix Sensing: The Curses of Symmetry and Initialization

Poster
in
Workshop: OPT 2023: Optimization for Machine Learning

How Over-Parameterization Slows Down Gradient Descent in Matrix Sensing: The Curses of Symmetry and Initialization

Nuoya Xiong · Lijun Ding · Simon Du

[ Abstract ] [ Project Page ]

[ Poster] [ OpenReview]

Abstract: This paper rigorously shows how over-parameterization dramatically changes the convergence behaviors of gradient descent (GD) for the matrix sensing problem, where the goal is to recover an unknown low-rank ground-truth matrix from near-isotropic linear measurements.First, we consider the symmetric setting with the symmetric parameterization where

M^{*} \in R^{n \times n}

is a positive semi-definite unknown matrix of rank

r ≪ n

, and one uses a symmetric parameterization

X X^{⊤}

to learn

M^{*}

. Here

X \in R^{n \times k}

with

k > r

is the factor matrix.We give a novel

Ω (1 / T^{2})

lower bound of randomly initialized GD for the over-parameterized case (

k > r

) where

T

is the number of iterations.This is in stark contrast to the exact-parameterization scenario (

k = r

) where the convergence rate is

\exp (- Ω (T))

.Next, we study asymmetric setting where

M^{*} \in R^{n_{1} \times n_{2}}

is the unknown matrix of rank

r ≪ min {n_{1}, n_{2}}

, and one uses an asymmetric parameterization

F G^{⊤}

to learn

M^{*}

where

F \in R^{n_{1} \times k}

and

G \in R^{n_{2} \times k}

.We give the first global exact convergence result of randomly initialized GD for the exact-parameterization case (

k = r

) with an

\exp (- Ω (T))

rate.Furthermore, we give the first global exact convergence result for the over-parameterization case (

k > r

) with an

\exp (- Ω (α^{2} T))

rate where

α

is the initialization scale.This linear convergence result in the over-parameterization case is especially significant because one can apply the asymmetric parameterization to the symmetric setting to speed up from

Ω (1 / T^{2})

to linear convergence. Therefore, we identify a surprising phenomenon: asymmetric parameterization can exponentially speed up convergence. Equally surprising is our analysis that highlights the importance of imbalance between

F

and

G

. This is in sharp contrast to prior works which emphasize balance. We further give an example showing the dependency on

α

in the convergence rate is unavoidable in the worst case. On the other hand, we propose a novel method that only modifies one step of GD and obtains a convergence rate independent of

α

, recovering the rate in the exact-parameterization case. We provide empirical studies to verify our theoretical findings.

Chat is not available.

Poster in Workshop: OPT 2023: Optimization for Machine Learning

How Over-Parameterization Slows Down Gradient Descent in Matrix Sensing: The Curses of Symmetry and Initialization

Nuoya Xiong · Lijun Ding · Simon Du

Poster
in
Workshop: OPT 2023: Optimization for Machine Learning