firstbacksecondback
1 Results
Workshop
|
Honesty to Subterfuge: In-Context Reinforcement Learning Can Make Honest Models Reward Hack Leo McKee-Reid · Joe Needham · Maria Martinez · Christoph Sträter · Mikita Balesni |