Timezone: »

On the Convergence of Encoder-only Shallow Transformers
Yongtao Wu · Fanghui Liu · Grigorios Chrysos · Volkan Cevher

Tue Dec 12 03:15 PM -- 05:15 PM (PST) @ Great Hall & Hall B1+B2 #1621

In this paper, we aim to build the global convergence theory of encoder-only shallow Transformers under a realistic setting from the perspective of architectures, initialization, and scaling under a finite width regime. The difficulty lies in how to tackle the softmax in self-attention mechanism, the core ingredient of Transformer. In particular, we diagnose the scaling scheme, carefully tackle the input/output of softmax, and prove that quadratic overparameterization is sufficient for global convergence of our shallow Transformers under commonly-used He/LeCun initialization in practice. Besides, neural tangent kernel (NTK) based analysis is also given, which facilitates a comprehensive comparison. Our theory demonstrates the separation on the importance of different scaling schemes and initialization. We believe our results can pave the way for a better understanding of modern Transformers, particularly on training dynamics.

Author Information

Yongtao Wu (EPFL)
Fanghui Liu (University of Warwick, UK)
Grigorios Chrysos (EPFL)
Volkan Cevher (EPFL)

More from the Same Authors