Timezone: »
Sharpness-Aware Minimization (SAM) is a highly effective regularization technique for improving the generalization of deep neural networks for various settings. However, the underlying working of SAM remains elusive because of various intriguing approximations in the theoretical characterizations. SAM intends to penalize a notion of sharpness of the model but implements a computationally efficient variant; moreover, a third notion of sharpness was used for proving generalization guarantees. The subtle differences in these notions of sharpness can indeed lead to significantly different empirical results. This paper rigorously nails down the exact sharpness notion that SAM regularizes and clarifies the underlying mechanism. We also show that the two steps of approximations in the original motivation of SAM individually lead to inaccurate local conclusions, but their combination accidentally reveals the correct effect, when full-batch gradients are applied. Furthermore, we also prove that the stochastic version of SAM in fact regularizes another notion of sharpness, which is most likely to be the preferred notion for practical performance. The key mechanism behind this intriguing phenomenon is the implicit alignment between the gradient and the top eigenvector of Hessian when running SAM.
Author Information
Kaiyue Wen (Tsinghua University, Tsinghua University)
Tengyu Ma (Stanford University)
Zhiyuan Li (Stanford University)
Related Events (a corresponding poster, oral, or spotlight)
-
2022 : How Sharpness-Aware Minimization Minimizes Sharpness? »
Dates n/a. Room
More from the Same Authors
-
2022 : First Steps Toward Understanding the Extrapolation of Nonlinear Models to Unseen Domains »
Kefan Dong · Tengyu Ma -
2022 : First Steps Toward Understanding the Extrapolation of Nonlinear Models to Unseen Domains »
Kefan Dong · Tengyu Ma -
2022 Poster: Statistically Meaningful Approximation: a Case Study on Approximating Turing Machines with Transformers »
Colin Wei · Yining Chen · Tengyu Ma -
2022 Poster: Implicit Bias of Gradient Descent on Reparametrized Models: On Equivalence to Mirror Descent »
Zhiyuan Li · Tianhao Wang · Jason Lee · Sanjeev Arora -
2022 Poster: Iterative Feature Matching: Toward Provable Domain Generalization with Logarithmic Environments »
Yining Chen · Elan Rosenfeld · Mark Sellke · Tengyu Ma · Andrej Risteski -
2022 Poster: Beyond Separability: Analyzing the Linear Transferability of Contrastive Representations to Related Subpopulations »
Jeff Z. HaoChen · Colin Wei · Ananya Kumar · Tengyu Ma -
2022 Poster: Understanding the Generalization Benefit of Normalization Layers: Sharpness Reduction »
Kaifeng Lyu · Zhiyuan Li · Sanjeev Arora -
2022 Poster: Fast Mixing of Stochastic Gradient Descent with Normalization and Weight Decay »
Zhiyuan Li · Tianhao Wang · Dingli Yu -
2021 Poster: On the Validity of Modeling SGD with Stochastic Differential Equations (SDEs) »
Zhiyuan Li · Sadhika Malladi · Sanjeev Arora -
2021 Poster: Gradient Descent on Two-layer Nets: Margin Maximization and Simplicity Bias »
Kaifeng Lyu · Zhiyuan Li · Runzhe Wang · Sanjeev Arora