Skip to yearly menu bar Skip to main content


Reward Model Underspecification in Language Model Alignment

Jacob Eisenstein ⋅ Jonathan Berant ⋅ Chirag Nagpal ⋅ Alekh Agarwal ⋅ Ahmad Beirami ⋅ Alexander D'Amour ⋅ Krishnamurthy Dvijotham ⋅ Katherine Heller ⋅ Stephen Pfohl ⋅ Deepak Ramachandran

Abstract

Video

Chat is not available.