Skip to yearly menu bar Skip to main content


Honesty to Subterfuge: In-Context Reinforcement Learning Can Make Honest Models Reward Hack

Leo McKee-Reid ⋅ Joe Needham ⋅ Maria Martinez ⋅ Christoph Sträter ⋅ Mikita Balesni

Abstract

Chat is not available.