Skip to yearly menu bar Skip to main content


Poster
in
Workshop: Safe Generative AI

Anchored Optimization and Contrastive Revisions: Addressing Reward Hacking in Alignment

Karel Doosterlinck ⋅ Winnie Xu ⋅ Chris Develder ⋅ Thomas Demeester ⋅ Amanpreet Singh ⋅ Christopher Potts ⋅ Douwe Kiela ⋅ Shikib Mehri

Abstract

Chat is not available.