Poster
Implicit Optimization Bias of Next-token Prediction in Linear Models
Christos Thrampoulidis
West Ballroom A-D #5703
Next-token prediction (NTP) has become the go-to training paradigm for modern language models, yet its optimization principles are not well-understood. To bridge this gap, we initiate a study of the structural properties of the solutions selected by gradient-based optimizers among the many possible minimizers of the NTP objective. By framing NTP as cross-entropy minimization across \emph{distinct} contexts, each tied with a \emph{sparse} conditional probability distribution across a finite vocabulary of tokens, we introduce ``NTP-separability conditions'' that enable reaching the data-entropy lower bound. With this setup, our focus then shifts to linear models, for which we characterize the optimization bias of gradient descent (GD): Within a data subspace defined by the sparsity patterns of distinct contexts, GD selects parameters that equate the logits' differences of in-support tokens to their log-odds. In the orthogonal subspace, the GD parameters diverge in norm and select the direction that maximizes a margin specific to NTP. These findings extend previous research on implicit bias in one-hot classification to the NTP setting, highlighting key differences and prompting further research into the optimization and generalization properties of NTP.
Live content is unavailable. Log in and register to view live content