Skip to yearly menu bar Skip to main content


Poster

In-Context Learning of a Linear Transformer Block: Benefits of the MLP Component and One-Step GD Initialization

Ruiqi Zhang · Jingfeng Wu · Peter Bartlett

East Exhibit Hall A-C #4804
[ ]
Wed 11 Dec 11 a.m. PST — 2 p.m. PST

Abstract: We study the \emph{in-context learning} (ICL) ability of a \emph{Linear Transformer Block} (LTB) that combines a linear attention component and a linear multi-layer perceptron (MLP) component. For ICL of linear regression with a Gaussian prior and a \emph{non-zero mean}, we show that LTB can achieve nearly Bayes optimal ICL risk. In contrast, using only linear attention must incur an irreducible additive approximation error. Furthermore, we establish a correspondence between LTB and one-step gradient descent estimators with learnable initialization (GDβ), in the sense that every GDβ estimator can be implemented by an LTB estimator and every optimal LTB estimator that minimizes the in-class ICL risk is effectively a GDβ estimator.Finally, we show that GDβ estimators can be efficiently optimized with gradient flow, despite a non-convex training objective.Our results reveal that LTB achieves ICL by implementing GDβ, and they highlight the role of MLP layers in reducing approximation error.

Chat is not available.