Skip to yearly menu bar Skip to main content

Workshop: OPT 2023: Optimization for Machine Learning

Adagrad Promotes Diffuse Solutions In Overparameterized Regimes

Andrew Rambidis · Jiayi Wang


With the high use of over-parameterized data in deep learning, the choice of optimizer in training plays a big role in a model's generalization ability due to solution selection bias. This work focuses on the adaptive gradient optimizer Adagrad, in the over-parameterized least-squares regime. We empirically find that when using sufficiently small step sizes, Adagrad promotes diffuse solutions in the sense of uniformity among the coordinates of the solution. Additionally, we theoretically show that Adagrad's solution, under the same conditions, exhibits greater diffusion compared to the solution obtained through gradient descent (GD) by analyzing the ratio of their updates. Lastly, we empirically compare the performance of Adagrad and GD on generated datasets. We observe a consistent trend that Adagrad promotes more diffused solutions, which aligns with our theoretical analysis.

Chat is not available.