Poster
in
Workshop: Foundation Models for Decision Making

In-Context Multi-Armed Bandits via Supervised Pretraining

Fred Zhang ⋅ Jiaxin Ye ⋅ Zhuoran Yang

Project Page [ OpenReview]

Abstract

Exploring the in-context learning capabilities of large transformer models, this research focuses on decision-making within reinforcement learning (RL) environments, specifically multi-armed bandit problems. We introduce the Reward Weighted Decision-Pretrained Transformer (DPT-RW), a model that uses straightforward supervised pretraining with a reward-weighted imitation learning loss. The DPT-RW predicts optimal actions by evaluating a query state and an in-context dataset across varied tasks. Surprisingly, this simple approach produces a model capable of solving a wide range of RL problems in-context, demonstrating online exploration and offline conservatism without specific training in these areas. A standout observation is the optimal performance of the model in the online setting, despite being trained on data generated from suboptimal policies and not having access to optimal data.

Chat is not available.