firstbacksecondback
4 Results
Workshop
|
Anchored Optimization and Contrastive Revisions: Addressing Reward Hacking in Alignment Karel Doosterlinck · Winnie Xu · Chris Develder · Thomas Demeester · Amanpreet Singh · Christopher Potts · Douwe Kiela · Shikib Mehri |
||
Poster
|
Thu 16:30 |
InfoRM: Mitigating Reward Hacking in RLHF via Information-Theoretic Reward Modeling Yuchun Miao · Sen Zhang · Liang Ding · Rong Bao · Lefei Zhang · Dacheng Tao |
|
Workshop
|
Honesty to Subterfuge: In-Context Reinforcement Learning Can Make Honest Models Reward Hack Leo McKee-Reid · Christoph Sträter · Maria Martinez · Joe Needham · Mikita Balesni |
||
Workshop
|
Honesty to Subterfuge: In-Context Reinforcement Learning Can Make Honest Models Reward Hack Leo McKee-Reid · Joe Needham · Maria Martinez · Christoph Sträter · Mikita Balesni |