Timezone: »
On The Fragility of Learned Reward Functions
Lev McKinney · Yawen Duan · Adam Gleave · David Krueger
Event URL: https://openreview.net/forum?id=9gj9vXfeS-y »
Reward functions are notoriously difficult to specify, especially for tasks with complex goals. Reward learning approaches attempt to infer reward functions from human feedback and preferences. Prior works on reward learning mainly focus on achieving high final performance for agents trained alongside the reward function. However, many of these works fail to investigate whether the resulting learned reward model accurately captures the intended behavior. In this work, we focus on the $\textit{relearning}$ failures of learned reward models. We demonstrate that when they are reused to train randomly initialized policies by designing experiments on both tabular and continuous control environments. We found that the severity of relearning failure might be sensitive to changes in reward model design and the trajectory dataset. Finally, we discussed the potential limitations of our methods and emphases the need for more retraining-based evaluations in the literature.
Reward functions are notoriously difficult to specify, especially for tasks with complex goals. Reward learning approaches attempt to infer reward functions from human feedback and preferences. Prior works on reward learning mainly focus on achieving high final performance for agents trained alongside the reward function. However, many of these works fail to investigate whether the resulting learned reward model accurately captures the intended behavior. In this work, we focus on the $\textit{relearning}$ failures of learned reward models. We demonstrate that when they are reused to train randomly initialized policies by designing experiments on both tabular and continuous control environments. We found that the severity of relearning failure might be sensitive to changes in reward model design and the trajectory dataset. Finally, we discussed the potential limitations of our methods and emphases the need for more retraining-based evaluations in the literature.
Author Information
Lev McKinney (University of Toronto)
Yawen Duan (University of Cambridge)
Adam Gleave (UC Berkeley)
David Krueger (University of Cambridge)
More from the Same Authors
-
2021 : BLAST: Latent Dynamics Models from Bootstrapping »
Keiran Paster · Lev McKinney · Sheila McIlraith · Jimmy Ba -
2021 : Multi-Domain Balanced Sampling Improves Out-of-Distribution Generalization of Chest X-ray Pathology Prediction Models »
Enoch Tetteh · David Krueger · Joseph Paul Cohen · Yoshua Bengio -
2021 : Preprocessing Reward Functions for Interpretability »
Erik Jenner · Adam Gleave -
2022 : Domain Generalization for Robust Model-Based Offline Reinforcement Learning »
Alan Clark · Shoaib Siddiqui · Robert Kirk · Usman Anwar · Stephen Chung · David Krueger -
2022 : Mechanistic Lens on Mode Connectivity »
Ekdeep S Lubana · Eric Bigelow · Robert Dick · David Krueger · Hidenori Tanaka -
2022 : Domain Generalization for Robust Model-Based Offline RL »
Alan Clark · Shoaib Siddiqui · Robert Kirk · Usman Anwar · Stephen Chung · David Krueger -
2022 : Training Equilibria in Reinforcement Learning »
Lauro Langosco · David Krueger · Adam Gleave -
2022 : Adversarial Policies Beat Professional-Level Go AIs »
Tony Wang · Adam Gleave · Nora Belrose · Tom Tseng · Michael Dennis · Yawen Duan · Viktor Pogrebniak · Joseph Miller · Sergey Levine · Stuart J Russell -
2022 : Assistance with large language models »
Dmitrii Krasheninnikov · Egor Krasheninnikov · David Krueger -
2022 : Adversarial Policies Beat Professional-Level Go AIs »
Tony Wang · Adam Gleave · Nora Belrose · Tom Tseng · Michael Dennis · Yawen Duan · Viktor Pogrebniak · Joseph Miller · Sergey Levine · Stuart Russell -
2022 : On The Fragility of Learned Reward Functions »
Lev McKinney · Yawen Duan · David Krueger · Adam Gleave -
2022 : Assistance with large language models »
Dmitrii Krasheninnikov · Egor Krasheninnikov · David Krueger -
2022 : Unifying Grokking and Double Descent »
Xander Davies · Lauro Langosco · David Krueger -
2023 Poster: Thinker: Learning to Plan and Act »
Stephen Chung · Ivan Anokhin · David Krueger -
2023 Workshop: Socially Responsible Language Modelling Research (SoLaR) »
Usman Anwar · David Krueger · Samuel Bowman · Jakob Foerster · Su Lin Blodgett · Roberta Raileanu · Alan Chan · Katherine Lee · Laura Ruis · Robert Kirk · Yawen Duan · Xin Chen · Kawin Ethayarajh -
2022 Poster: Defining and Characterizing Reward Gaming »
Joar Skalse · Nikolaus Howe · Dmitrii Krasheninnikov · David Krueger