Timezone: »
Policy Aware Model Learning via Transition Occupancy Matching
Jason Yecheng Ma · Kausik Sivakumar · Osbert Bastani · Dinesh Jayaraman
Event URL: https://openreview.net/forum?id=UW-VQXWuuWB »
Model-based reinforcement learning (MBRL) is an effective paradigm for sample-efficient policy learning. The pre-dominant MBRL strategy iteratively learns the dynamics model by performing maximum likelihood (MLE) on the entire replay buffer and trains the policy using fictitious transitions from the learned model. Given that not all transitions in the replay buffer are equally informative about the task or the policy's current progress, this MLE strategy cannot be optimal and bears no clear relation to the standard RL objective. In this work, we propose Transition Occupancy Matching (TOM), a policy-aware model learning algorithm that maximizes a lower bound on the standard RL objective. TOM learns a policy-aware dynamics model by minimizing an $f$-divergence between the distribution of transitions that the current policy visits in the real environment and in the learned model; then, the policy can be updated using any pre-existing RL algorithm with log-transformed reward. TOM's practical implementation builds on tools from dual reinforcement learning and learns the optimal transition occupancy ratio between the current policy and the replay buffer; leveraging this ratio as importance weights, TOM amounts to performing MLE model learning on the correct, policy aware transition distribution. Crucially, TOM is a model learning sub-routine and is compatible with any backbone MBRL algorithm that implements MLE-based model learning. On the standard set of Mujoco locomotion tasks, we find TOM improves the learning speed of a standard MBRL algorithm and can reach the same asymptotic performance with as much as 50% fewer samples.
Model-based reinforcement learning (MBRL) is an effective paradigm for sample-efficient policy learning. The pre-dominant MBRL strategy iteratively learns the dynamics model by performing maximum likelihood (MLE) on the entire replay buffer and trains the policy using fictitious transitions from the learned model. Given that not all transitions in the replay buffer are equally informative about the task or the policy's current progress, this MLE strategy cannot be optimal and bears no clear relation to the standard RL objective. In this work, we propose Transition Occupancy Matching (TOM), a policy-aware model learning algorithm that maximizes a lower bound on the standard RL objective. TOM learns a policy-aware dynamics model by minimizing an $f$-divergence between the distribution of transitions that the current policy visits in the real environment and in the learned model; then, the policy can be updated using any pre-existing RL algorithm with log-transformed reward. TOM's practical implementation builds on tools from dual reinforcement learning and learns the optimal transition occupancy ratio between the current policy and the replay buffer; leveraging this ratio as importance weights, TOM amounts to performing MLE model learning on the correct, policy aware transition distribution. Crucially, TOM is a model learning sub-routine and is compatible with any backbone MBRL algorithm that implements MLE-based model learning. On the standard set of Mujoco locomotion tasks, we find TOM improves the learning speed of a standard MBRL algorithm and can reach the same asymptotic performance with as much as 50% fewer samples.
Author Information
Jason Yecheng Ma (University of Pennsylvania)
Kausik Sivakumar (University of Pennsylvania, University of Pennsylvania)
I'm a final year Robotics Master's student at the University of Pennsylvania. I work with Prof.Dinesh Jayaraman and Prof.Osbert Bastani on Reinforcement Learning research. I'm excited about robot learning research and I want to apply learning based controls in areas of manipulation/mobile robots. I would be applying to PhD programs this cycle. If you find my profile interesting or want to connect, feel free to contact me at "kausik@seas.upenn.edu"
Osbert Bastani (University of Pennsylvania)
Dinesh Jayaraman (University of Pennsylvania)
I am an assistant professor at UPenn’s GRASP lab. I lead the Perception, Action, and Learning (PAL) Research Group, where we work on problems at the intersection of computer vision, machine learning, and robotics.
More from the Same Authors
-
2020 : Paper 50: Diverse Sampling for Flow-Based Trajectory Forecasting »
Jason Yecheng Ma · Jeevana Priya Inala · Dinesh Jayaraman · Osbert Bastani -
2021 Spotlight: Program Synthesis Guided Reinforcement Learning for Partially Observed Environments »
Yichen Yang · Jeevana Priya Inala · Osbert Bastani · Yewen Pu · Armando Solar-Lezama · Martin Rinard -
2021 : Conservative and Adaptive Penalty for Model-Based Safe Reinforcement Learning »
Jason Yecheng Ma · Andrew Shen · Osbert Bastani · Dinesh Jayaraman -
2021 : Specification-Guided Learning of Nash Equilibria with High Social Welfare »
Kishor Jothimurugan · Suguman Bansal · Osbert Bastani · Rajeev Alur -
2021 : PAC Synthesis of Machine Learning Programs »
Osbert Bastani -
2021 : Synthesizing Video Trajectory Queries »
Stephen Mell · Favyen Bastani · Stephan Zdancewic · Osbert Bastani -
2021 : Object Representations Guided By Optical Flow »
Jianing Qian · Dinesh Jayaraman -
2021 : Improving Human Decision-Making with Machine Learning »
Hamsa Bastani · Osbert Bastani · Park Sinchaisri -
2021 : Conservative and Adaptive Penalty for Model-Based Safe Reinforcement Learning »
Jason Yecheng Ma · Andrew Shen · Osbert Bastani · Dinesh Jayaraman -
2021 : Conservative and Adaptive Penalty for Model-Based Safe Reinforcement Learning »
Jason Yecheng Ma · Andrew Shen · Osbert Bastani · Dinesh Jayaraman -
2022 : Bandits for Online Calibration: An Application to Content Moderation on Social Media Platforms »
Vashist Avadhanula · Omar Abdul Baki · Hamsa Bastani · Osbert Bastani · Caner Gocmen · Daniel Haimovich · Darren Hwang · Dmytro Karamshuk · Thomas Leeper · Jiayuan Ma · Gregory macnamara · Jake Mullet · Christopher Palow · Sung Park · Varun S Rajagopal · Kevin Schaeffer · Parikshit Shah · Deeksha Sinha · Nicolas Stier-Moses · Ben Xu -
2022 : Bandits for Online Calibration: An Application to Content Moderation on Social Media Platforms »
Vashist Avadhanula · Omar Abdul Baki · Hamsa Bastani · Osbert Bastani · Caner Gocmen · Daniel Haimovich · Darren Hwang · Dmytro Karamshuk · Thomas Leeper · Jiayuan Ma · Gregory macnamara · Jake Mullet · Christopher Palow · Sung Park · Varun S Rajagopal · Kevin Schaeffer · Parikshit Shah · Deeksha Sinha · Nicolas Stier-Moses · Ben Xu -
2022 : Towards Universal Visual Reward and Representation via Value-Implicit Pre-Training »
Jason Yecheng Ma · Shagun Sodhani · Dinesh Jayaraman · Osbert Bastani · Vikash Kumar · Amy Zhang -
2022 : VIP: Towards Universal Visual Reward and Representation via Value-Implicit Pre-Training »
Jason Yecheng Ma · Shagun Sodhani · Dinesh Jayaraman · Osbert Bastani · Vikash Kumar · Amy Zhang -
2022 : Learning a Meta-Controller for Dynamic Grasping »
Yinsen Jia · Jingxi Xu · Dinesh Jayaraman · Shuran Song -
2022 : Towards Universal Visual Reward and Representation via Value-Implicit Pre-Training »
Jason Yecheng Ma · Shagun Sodhani · Dinesh Jayaraman · Osbert Bastani · Vikash Kumar · Amy Zhang -
2022 : Bandits for Online Calibration: An Application to Content Moderation on Social Media Platforms »
Vashist Avadhanula · Omar Abdul Baki · Hamsa Bastani · Osbert Bastani · Caner Gocmen · Daniel Haimovich · Darren Hwang · Dmytro Karamshuk · Thomas Leeper · Jiayuan Ma · Gregory macnamara · Jake Mullet · Christopher Palow · Sung Park · Varun S Rajagopal · Kevin Schaeffer · Parikshit Shah · Deeksha Sinha · Nicolas Stier-Moses · Ben Xu -
2022 : Robust Option Learning for Adversarial Generalization »
Kishor Jothimurugan · Steve Hsu · Osbert Bastani · Rajeev Alur -
2022 : VIP: Towards Universal Visual Reward and Representation via Value-Implicit Pre-Training »
Jason Yecheng Ma · Shagun Sodhani · Dinesh Jayaraman · Osbert Bastani · Vikash Kumar · Amy Zhang -
2022 : Learning a Meta-Controller for Dynamic Grasping »
Yinsen Jia · Jingxi Xu · Dinesh Jayaraman · Shuran Song -
2022 : Towards Universal Visual Reward and Representation via Value-Implicit Pre-Training »
Jason Yecheng Ma · Shagun Sodhani · Dinesh Jayaraman · Osbert Bastani · Vikash Kumar · Amy Zhang -
2022 : Bandits for Online Calibration: An Application to Content Moderation on Social Media Platforms »
Vashist Avadhanula · Omar Abdul Baki · Hamsa Bastani · Osbert Bastani · Caner Gocmen · Daniel Haimovich · Darren Hwang · Dmytro Karamshuk · Thomas Leeper · Jiayuan Ma · Gregory macnamara · Jake Mullet · Christopher Palow · Sung Park · Varun S Rajagopal · Kevin Schaeffer · Parikshit Shah · Deeksha Sinha · Nicolas Stier-Moses · Ben Xu -
2022 Poster: PAC Prediction Sets for Meta-Learning »
Sangdon Park · Edgar Dobriban · Insup Lee · Osbert Bastani -
2022 Poster: Offline Goal-Conditioned Reinforcement Learning via $f$-Advantage Regression »
Jason Yecheng Ma · Jason Yan · Dinesh Jayaraman · Osbert Bastani -
2022 Poster: Neurosymbolic Deep Generative Models for Sequence Data with Relational Constraints »
Halley Young · Maxwell Du · Osbert Bastani -
2022 Poster: Regret Bounds for Risk-Sensitive Reinforcement Learning »
Osbert Bastani · Jason Yecheng Ma · Estelle Shen · Wanqiao Xu -
2022 Poster: Practical Adversarial Multivalid Conformal Prediction »
Osbert Bastani · Varun Gupta · Christopher Jung · Georgy Noarov · Ramya Ramalingam · Aaron Roth -
2021 Poster: Conservative Offline Distributional Reinforcement Learning »
Jason Yecheng Ma · Dinesh Jayaraman · Osbert Bastani -
2021 Poster: Compositional Reinforcement Learning from Logical Specifications »
Kishor Jothimurugan · Suguman Bansal · Osbert Bastani · Rajeev Alur -
2021 Poster: Program Synthesis Guided Reinforcement Learning for Partially Observed Environments »
Yichen Yang · Jeevana Priya Inala · Osbert Bastani · Yewen Pu · Armando Solar-Lezama · Martin Rinard -
2021 Poster: Learning Models for Actionable Recourse »
Alexis Ross · Himabindu Lakkaraju · Osbert Bastani -
2020 Poster: Neurosymbolic Transformers for Multi-Agent Communication »
Jeevana Priya Inala · Yichen Yang · James Paulos · Yewen Pu · Osbert Bastani · Vijay Kumar · Martin Rinard · Armando Solar-Lezama -
2020 Session: Orals & Spotlights Track 18: Deep Learning »
Yale Song · Dinesh Jayaraman -
2020 Poster: Long-Horizon Visual Planning with Goal-Conditioned Hierarchical Predictors »
Karl Pertsch · Oleh Rybkin · Frederik Ebert · Shenghao Zhou · Dinesh Jayaraman · Chelsea Finn · Sergey Levine -
2020 Poster: Fighting Copycat Agents in Behavioral Cloning from Observation Histories »
Chuan Wen · Jierui Lin · Trevor Darrell · Dinesh Jayaraman · Yang Gao -
2019 : Poster Session »
Matthia Sabatelli · Adam Stooke · Amir Abdi · Paulo Rauber · Leonard Adolphs · Ian Osband · Hardik Meisheri · Karol Kurach · Johannes Ackermann · Matt Benatan · GUO ZHANG · Chen Tessler · Dinghan Shen · Mikayel Samvelyan · Riashat Islam · Murtaza Dalal · Luke Harries · Andrey Kurenkov · Konrad Żołna · Sudeep Dasari · Kristian Hartikainen · Ofir Nachum · Kimin Lee · Markus Holzleitner · Vu Nguyen · Francis Song · Christopher Grimm · Felipe Leno da Silva · Yuping Luo · Yifan Wu · Alex Lee · Thomas Paine · Wei-Yang Qu · Daniel Graves · Yannis Flet-Berliac · Yunhao Tang · Suraj Nair · Matthew Hausknecht · Akhil Bagaria · Simon Schmitt · Bowen Baker · Paavo Parmas · Benjamin Eysenbach · Lisa Lee · Siyu Lin · Daniel Seita · Abhishek Gupta · Riley Simmons-Edler · Yijie Guo · Kevin Corder · Vikash Kumar · Scott Fujimoto · Adam Lerer · Ignasi Clavera Gilaberte · Nicholas Rhinehart · Ashvin Nair · Ge Yang · Lingxiao Wang · Sungryull Sohn · J. Fernando Hernandez-Garcia · Xian Yeow Lee · Rupesh Srivastava · Khimya Khetarpal · Chenjun Xiao · Luckeciano Carvalho Melo · Rishabh Agarwal · Tianhe Yu · Glen Berseth · Devendra Singh Chaplot · Jie Tang · Anirudh Srinivasan · Tharun Kumar Reddy Medini · Aaron Havens · Misha Laskin · Asier Mujika · Rohan Saphal · Joseph Marino · Alex Ray · Joshua Achiam · Ajay Mandlekar · Zhuang Liu · Danijar Hafner · Zhiwen Tang · Ted Xiao · Michael Walton · Jeff Druce · Ferran Alet · Zhang-Wei Hong · Stephanie Chan · Anusha Nagabandi · Hao Liu · Hao Sun · Ge Liu · Dinesh Jayaraman · John Co-Reyes · Sophia Sanborn -
2019 Poster: A Composable Specification Language for Reinforcement Learning Tasks »
Kishor Jothimurugan · Rajeev Alur · Osbert Bastani -
2018 Poster: Verifiable Reinforcement Learning via Policy Extraction »
Osbert Bastani · Yewen Pu · Armando Solar-Lezama