Skip to yearly menu bar Skip to main content


Search All 2024 Events
 

4 Results

<<   <   Page 1 of 1   >>   >
Workshop
Anchored Optimization and Contrastive Revisions: Addressing Reward Hacking in Alignment
Karel Doosterlinck · Winnie Xu · Chris Develder · Thomas Demeester · Amanpreet Singh · Christopher Potts · Douwe Kiela · Shikib Mehri
Poster
Thu 16:30 InfoRM: Mitigating Reward Hacking in RLHF via Information-Theoretic Reward Modeling
Yuchun Miao · Sen Zhang · Liang Ding · Rong Bao · Lefei Zhang · Dacheng Tao
Workshop
Honesty to Subterfuge: In-Context Reinforcement Learning Can Make Honest Models Reward Hack
Leo McKee-Reid · Christoph Sträter · Maria Martinez · Joe Needham · Mikita Balesni
Workshop
Honesty to Subterfuge: In-Context Reinforcement Learning Can Make Honest Models Reward Hack
Leo McKee-Reid · Joe Needham · Maria Martinez · Christoph Sträter · Mikita Balesni