Timezone: »
The problem of offline reinforcement learning focuses on learning a good policy from a log of environment interactions. Past efforts for developing algorithms in this area have revolved around introducing constraints to online reinforcement algorithms to ensure the actions of the learned policy are constrained to the logged data. In this work, we explore an alternative approach by planning on the fixed dataset directly. Specifically, we introduce an algorithm which forms a tabular Markov Decision Process (MDP) over the logged data by adding new transitions to the dataset. We do this by using learned dynamics models to plan short trajectories between states. Since exact value iteration can be performed on this constructed MDP, it becomes easy to identify which trajectories are advantageous to add to the MDP. Crucially, since most transitions in this MDP come from the logged data, trajectories from the MDP can be rolled out for long periods with confidence. We prove that this property allows one to make upper and lower bounds on the value function up to appropriate distance metrics. Finally, we demonstrate empirically how algorithms that uniformly constrain the learned policy to the entire dataset can result in unwanted behavior, and we show an example in which simply behavior cloning the optimal policy of the MDP created by our algorithm avoids this problem.
Author Information
Ian Char (Carnegie Mellon University)
Viraj Mehta (Carnegie Mellon University)
Adam Villaflor (Carnegie Mellon University)
John Dolan (Carnegie Mellon University)
Jeff Schneider (CMU)
More from the Same Authors
-
2021 : Parametric-Control Barrier Function-based Adaptive Safe Merging Control for Heterogeneous Vehicles »
Yiwei Lyu · John Dolan -
2022 : Offline Model-Based Reinforcement Learning for Tokamak Control »
Ian Char · Joseph Abbate · Laszlo Bardoczi · Mark Boyer · Youngseog Chung · Rory Conlin · Keith Erickson · Viraj Mehta · Nathan Richner · Egemen Kolemen · Jeff Schneider -
2022 Poster: Exploration via Planning for Information about the Optimal Trajectory »
Viraj Mehta · Ian Char · Joseph Abbate · Rory Conlin · Mark Boyer · Stefano Ermon · Jeff Schneider · Willie Neiswanger -
2021 : Bayesian Active Reinforcement Learning »
Viraj Mehta · Biswajit Paria · Jeff Schneider · Willie Neiswanger -
2021 Poster: Beyond Pinball Loss: Quantile Methods for Calibrated Uncertainty Quantification »
Youngseog Chung · Willie Neiswanger · Ian Char · Jeff Schneider