Skip to yearly menu bar Skip to main content

Workshop: Workshop on robustness of zero/few-shot learning in foundation models (R0-FoMo)

In-Context Learning and Bayesian Inference

Madhur Panwar · Kabir Ahuja · Navin Goyal

Abstract: In-context learning (ICL) is one of the surprising and useful features of large language models and subject of intense research. Recently, stylized meta-learning-like ICL setups have been devised that train transformers on sequences of input-output pairs $(x, f(x))$ using the language modeling loss. The function $f$ comes from a function class and generalization is checked by evaluation on sequences for unseen functions from the same class. One of the main discoveries in this line of research has been that for several function classes, such as linear regression, transformers successfully generalize to new functions in the class. However, it is unclear if transformers trained on multiple function classes (a setup closer to that of real-world LLMs) also exhibit this generalization. Moreover, the inductive biases of these models resulting in this generalization are not clearly understood. A model with unlimited training data and compute is a Bayesian predictor: it learns the pretraining distribution. In this paper we address these issues and empirically examine how far this Bayesian perspective can help us understand ICL. To this end, we generalize the previous meta-ICL setup to hierarchical meta-ICL setup which involve unions of multiple task families. We instantiate this setup on a diverse range of linear and nonlinear function families and find that transformers can do ICL in this setting as well. Where Bayesian inference is tractable, we find evidence that high-capacity transformers mimic the Bayesian predictor. Via the example of learning Fourier series, we also study the inductive bias for in-context learning. We find that in-context learning may or may not have simplicity bias depending on the pretraining data distribution. The Bayesian perspective provides insights into these inductive biases and how transformers perform a particular task when they are trained on multiple tasks.

Chat is not available.