Timezone: »

Guiding Offline Reinforcement Learning Using a Safety Expert
Richa Verma · Kartik Bharadwaj · Harshad Khadilkar · Balaraman Ravindran
Event URL: https://openreview.net/forum?id=6LlUmoSpzR »

Offline reinforcement learning is used to train policies in situations where it is expensive or infeasible to access the environment during training. An agent trained under such a scenario does not get corrective feedback once the learned policy starts diverging and may fall prey to the overestimation bias commonly seen in this setting. This increases the chances of the agent choosing unsafe/risky actions, especially in states with sparse to no representation in the training dataset. In this paper, we propose to leverage a safety expert to discourage the offline RL agent from choosing unsafe actions in under-represented states in the dataset. The proposed framework in this paper transfers the safety expert's knowledge in an offline setting for states with high uncertainty to prevent catastrophic failures from occurring in safety-critical domains. We use a simple but effective approach to quantify the state uncertainty based on how frequently they appear in a training dataset. In states with high uncertainty, the offline RL agent mimics the safety expert while maximizing the long-term reward. We modify TD3+BC, an existing offline RL algorithm, as a part of the proposed approach. We demonstrate empirically that our approach performs better than TD3+BC on some control tasks and comparably on others across two sets of benchmark datasets while reducing the chance of taking unsafe actions in sparse regions of the state space.

Author Information

Richa Verma (TCS Research)
Kartik Bharadwaj (Indian Institute of Technology, Madras)

I am a MS CS student at IITM. I am interested in Safe RL, and MARL.

Harshad Khadilkar (Tata Consultancy Services Ltd)

Scientist with TCS Research and Visiting Associate Professor at IIT Bombay. Educational background includes a bachelors in engineering from IIT Bombay, followed by a masters and a PhD from MIT (2013).

Balaraman Ravindran (Indian Institute of Technology Madras)

More from the Same Authors