Timezone: »
Offline reinforcement learning, by learning from a fixed dataset, makes it possible to learn agent behaviors without interacting with the environment. However, depending on the quality of the offline dataset, such pre-trained agents may have limited performance and would further need to be fine-tuned online by interacting with the environment. During online fine-tuning, the performance of the pre-trained agent may collapse quickly due to the sudden distribution shift from offline to online data. While constraints enforced by offline RL methods such as a behaviour cloning loss prevent this to an extent, these constraints also significantly slow down online fine-tuning by forcing the agent to stay close to the behavior policy. We propose to adaptively weigh the behavior cloning loss during online fine-tuning based on the agent's performance and training stability. Moreover, we use a randomized ensemble of Q functions to further increase the sample efficiency of online fine-tuning by performing a large number of learning updates. Experiments show that the proposed method yields state-of-the-art offline-to-online reinforcement learning performance on the popular D4RL benchmark.
Author Information
Yi Zhao (Aalto University)
Rinu Boney (Aalto University)
Alexander Ilin (Aalto University)
Juho Kannala (Aalto University)
Joni Pajarinen (Aalto University)
More from the Same Authors
-
2022 : Meta-learning from demonstrations improves compositional generalization »
Sam Spilsbury · Alexander Ilin -
2022 : Constrained Imitation Q-learning with Earth Mover’s Distance reward »
WENYAN Yang · Nataliya Strokina · Joni Pajarinen · Joni-kristian Kamarainen -
2022 : Learning Explicit Object-Centric Representations with Vision Transformers »
Oscar Vikström · Alexander Ilin -
2022 Poster: Redeeming intrinsic rewards via constrained optimization »
Eric Chen · Zhang-Wei Hong · Joni Pajarinen · Pulkit Agrawal -
2020 Poster: Deep Automodulators »
Ari Heljakka · Yuxin Hou · Juho Kannala · Arno Solin -
2019 Poster: Regularizing Trajectory Optimization with Denoising Autoencoders »
Rinu Boney · Norman Di Palo · Mathias Berglund · Alexander Ilin · Juho Kannala · Antti Rasmus · Harri Valpola