Timezone: »

MOPA: a Minimalist Off-Policy Approach to Safe-RL
Hao Sun · Ziping Xu · Zhenghao Peng · Meng Fang · Bo Dai · Bolei Zhou
Event URL: https://openreview.net/forum?id=L86zZ6ow_WE »

Safety is one of the crucial concerns for the real-world application of reinforcement learning (RL). Previous works consider the safe exploration problem as Constrained Markov Decision Process (CMDP), where the policies are being optimized under constraints. However, when encountering any potential danger, human tends to stop immediately and rarely learns to behave safely in danger. Moreover, the off-policy learning nature of humans guarantees high learning efficiency in risky tasks. Motivated by human learning, we introduce a Minimalist Off-Policy Approach (MOPA) to address Safe-RL problem. We first define the Early Terminated MDP (ET-MDP) as a special type of MDPs that has the same optimal value function as its CMDP counterpart. An off-policy learning algorithm MOPA based on recurrent models is then proposed to solve the ET-MDP, which thereby solves the corresponding CMDP. Experiments on various Safe-RL tasks show a substantial improvement over previous methods that directly solve CMDP, in terms of higher asymptotic performance and better learning efficiency.

Author Information

Hao Sun (University of Cambridge)
Ziping Xu (University of Michigan)

My name is Ziping Xu. I am a fifth-year Ph.D. student in Statistics at the University of Michigan. My research interests are on sample efficient reinforcement learning and transfer learning, multitask learning. I am looking for research-orientated full-time job starting Fall 2023

Zhenghao Peng (University of California, Los Angeles)
Meng Fang (Tencent)
Bo Dai (Shanghai AI Lab)
Bolei Zhou (UCLA)
Bolei Zhou

Assistant professor at UCLA's computer science department

More from the Same Authors