Poster
in
Workshop: OPT 2023: Optimization for Machine Learning
How Over-Parameterization Slows Down Gradient Descent in Matrix Sensing: The Curses of Symmetry and Initialization
Nuoya Xiong · Lijun Ding · Simon Du
Abstract:
This paper rigorously shows how over-parameterization dramatically changes the convergence behaviors of gradient descent (GD) for the matrix sensing problem, where the goal is to recover an unknown low-rank ground-truth matrix from near-isotropic linear measurements.First, we consider the symmetric setting with the symmetric parameterization where is a positive semi-definite unknown matrix of rank , and one uses a symmetric parameterization to learn . Here with is the factor matrix.We give a novel lower bound of randomly initialized GD for the over-parameterized case () where is the number of iterations.This is in stark contrast to the exact-parameterization scenario () where the convergence rate is .Next, we study asymmetric setting where is the unknown matrix of rank , and one uses an asymmetric parameterization to learn where and .We give the first global exact convergence result of randomly initialized GD for the exact-parameterization case () with an rate.Furthermore, we give the first global exact convergence result for the over-parameterization case () with an rate where is the initialization scale.This linear convergence result in the over-parameterization case is especially significant because one can apply the asymmetric parameterization to the symmetric setting to speed up from to linear convergence. Therefore, we identify a surprising phenomenon: asymmetric parameterization can exponentially speed up convergence. Equally surprising is our analysis that highlights the importance of imbalance between and . This is in sharp contrast to prior works which emphasize balance. We further give an example showing the dependency on in the convergence rate is unavoidable in the worst case. On the other hand, we propose a novel method that only modifies one step of GD and obtains a convergence rate independent of , recovering the rate in the exact-parameterization case. We provide empirical studies to verify our theoretical findings.
Chat is not available.