Oral
Smoothing the Landscape Boosts the Signal for SGD: Optimal Sample Complexity for Learning Single Index Models
Alex Damian · Eshaan Nichani · Rong Ge · Jason Lee
Hall C2 (level 1 gate 9 south of food court)
Abstract:
We focus on the task of learning a single index model σ(w⋆⋅x)σ(w⋆⋅x) with respect to the isotropic Gaussian distribution in dd dimensions. Prior work has shown that the sample complexity of learning w⋆w⋆ is governed by the information exponent k⋆k⋆ of the link function σσ, which is defined as the index of the first nonzero Hermite coefficient of σσ. Ben Arous et al. (2021) showed that n≳dk⋆−1 samples suffice for learning w⋆ and that this is tight for online SGD. However, the CSQ lower bound for gradient based methods only shows that n≳dk⋆/2 samples are necessary. In this work, we close the gap between the upper and lower bounds by showing that online SGD on a smoothed loss learns w⋆ with n≳dk⋆/2 samples. We also draw connections to statistical analyses of tensor PCA and to the implicit regularization effects of minibatch SGD on empirical losses.
Chat is not available.