Skip to yearly menu bar Skip to main content


Reward Model Underspecification in Language Model Alignment

Jacob Eisenstein · Jonathan Berant · Chirag Nagpal · Alekh Agarwal · Ahmad Beirami · Alexander D'Amour · Krishnamurthy Dvijotham · Katherine Heller · Stephen Pfohl · Deepak Ramachandran

Abstract

Video

Chat is not available.