Timezone: »

Policy Aware Model Learning via Transition Occupancy Matching
Jason Yecheng Ma · Kausik Sivakumar · Osbert Bastani · Dinesh Jayaraman

Model-based reinforcement learning (MBRL) is an effective paradigm for sample-efficient policy learning. The pre-dominant MBRL strategy iteratively learns the dynamics model by performing maximum likelihood (MLE) on the entire replay buffer and trains the policy using fictitious transitions from the learned model. Given that not all transitions in the replay buffer are equally informative about the task or the policy's current progress, this MLE strategy cannot be optimal and bears no clear relation to the standard RL objective. In this work, we propose Transition Occupancy Matching (TOM), a policy-aware model learning algorithm that maximizes a lower bound on the standard RL objective. TOM learns a policy-aware dynamics model by minimizing an $f$-divergence between the distribution of transitions that the current policy visits in the real environment and in the learned model; then, the policy can be updated using any pre-existing RL algorithm with log-transformed reward. TOM's practical implementation builds on tools from dual reinforcement learning and learns the optimal transition occupancy ratio between the current policy and the replay buffer; leveraging this ratio as importance weights, TOM amounts to performing MLE model learning on the correct, policy aware transition distribution. Crucially, TOM is a model learning sub-routine and is compatible with any backbone MBRL algorithm that implements MLE-based model learning. On the standard set of Mujoco locomotion tasks, we find TOM improves the learning speed of a standard MBRL algorithm and can reach the same asymptotic performance with as much as 50% fewer samples.

#### Author Information

##### Kausik Sivakumar (University of Pennsylvania, University of Pennsylvania)

I'm a final year Robotics Master's student at the University of Pennsylvania. I work with Prof.Dinesh Jayaraman and Prof.Osbert Bastani on Reinforcement Learning research. I'm excited about robot learning research and I want to apply learning based controls in areas of manipulation/mobile robots. I would be applying to PhD programs this cycle. If you find my profile interesting or want to connect, feel free to contact me at "kausik@seas.upenn.edu"

##### Dinesh Jayaraman (University of Pennsylvania)

I am an assistant professor at UPenn’s GRASP lab. I lead the Perception, Action, and Learning (PAL) Research Group, where we work on problems at the intersection of computer vision, machine learning, and robotics.