Skip to yearly menu bar Skip to main content


Honesty to Subterfuge: In-Context Reinforcement Learning Can Make Honest Models Reward Hack

Leo McKee-Reid · Joe Needham · Maria Martinez · Christoph Sträter · Mikita Balesni

Abstract

Chat is not available.