Poster
in
Workshop: Workshop on Distribution Shifts: New Frontiers with Foundation Models

Reward Model Underspecification in Language Model Alignment

Jacob Eisenstein ⋅ Jonathan Berant ⋅ Chirag Nagpal ⋅ Alekh Agarwal ⋅ Ahmad Beirami ⋅ Alexander D'Amour ⋅ Krishnamurthy Dvijotham ⋅ Katherine Heller ⋅ Stephen Pfohl ⋅ Deepak Ramachandran

Keywords: alignment underspecification reward models ensembles

Project Page [ OpenReview]

Abstract

Reward models play a key role in aligning language model applications towards human preferences. However, this setup can create a dynamic in which the policy model has the incentive to exploit errors in the reward model to achieve high reward. This means that the success of reward-based alignment depends on the ability of reward models to transfer to new distributions created by the aligned policy model. We show that reward models are \emph{underspecified}, in the sense that models that perform similarly in-distribution can yield very different rewards on policy model outputs. These differences propagate to the aligned policies, which we show to be heavily influenced by the random seed used during \emph{pretraining} of the reward model. We show that even a simple alignment strategy --- best-of-$n$ reranking --- creates a semi-adversarial dynamic between the policy and reward models, promoting outputs on which the reward models are more likely to disagree. Finally, we show that a simple ensembling strategy can help to address this issue.

Video

Chat is not available.