Timezone: »
We study model-based offline Reinforcement Learning with general function approximation without a full coverage assumption on the offline data distribution. We present an algorithm named Constrained Pessimistic Policy Optimization (CPPO) which leverages a general function class and uses a constraint over the models to encode pessimism. Under the assumption that the ground truth model belongs to our function class (i.e., realizability in the function class), CPPO has a PAC guarantee with offline data only providing partial coverage, i.e., it can learn a policy that competes against any policy covered by the offline data. We then demonstrate that this algorithmic framework can be applied to many specialized Markov Decision Processes where the additional structural assumptions can further refine the concept of partial coverage. Two notable examples are: (1) low- rank MDP with representation learning where the partial coverage condition is defined using a relative condition number measured by the unknown ground truth feature representation; (2) factored MDP where the partial coverage condition is defined using density-ratio based concentrability coefficients associated with individual factors.
Author Information
Masatoshi Uehara (Cornell University)
Wen Sun (Cornell University)
More from the Same Authors
-
2022 : Hybrid RL: Using both offline and online data can make RL efficient »
Yuda Song · Yifei Zhou · Ayush Sekhari · J. Bagnell · Akshay Krishnamurthy · Wen Sun -
2022 : Provable Benefits of Representational Transfer in Reinforcement Learning »
Alekh Agarwal · Yuda Song · Kaiwen Wang · Mengdi Wang · Wen Sun · Xuezhou Zhang -
2023 Poster: Offline Minimax Soft-Q-learning Under Realizability and Partial Coverage »
Masatoshi Uehara · Nathan Kallus · Jason Lee · Wen Sun -
2023 Poster: Teaching Cars to See in a Day: Unsupervised Object Discovery with Reward Fine-tuning »
Katie Luo · Zhenzhen Liu · Xiangyu Chen · Yurong You · Sagie Benaim · Cheng Perng Phoo · Mark Campbell · Wen Sun · Bharath Hariharan · Kilian Weinberger -
2023 Poster: Contextual Bandits and Imitation Learning with Preference-Based Active Queries »
Ayush Sekhari · Karthik Sridharan · Wen Sun · Runzhe Wu -
2023 Poster: The Benefits of Being Distributional: Small-Loss Bounds for Reinforcement Learning »
Kaiwen Wang · Kevin Zhou · Runzhe Wu · Nathan Kallus · Wen Sun -
2023 Poster: Selective Sampling and Imitation Learning via Online Regression »
Ayush Sekhari · Karthik Sridharan · Wen Sun · Runzhe Wu -
2023 Poster: Future-Dependent Value-Based Off-Policy Evaluation in POMDPs »
Masatoshi Uehara · Haruka Kiyohara · Andrew Bennett · Victor Chernozhukov · Nan Jiang · Nathan Kallus · Chengchun Shi · Wen Sun -
2022 Poster: Provably Efficient Reinforcement Learning in Partially Observable Dynamical Systems »
Masatoshi Uehara · Ayush Sekhari · Jason Lee · Nathan Kallus · Wen Sun -
2021 : Representation Learning for Online and Offline RL in Low-rank MDPs »
Masatoshi Uehara · Xuezhou Zhang · Wen Sun -
2021 Workshop: Causal Inference Challenges in Sequential Decision Making: Bridging Theory and Practice »
Aurelien Bibaut · Maria Dimakopoulou · Nathan Kallus · Xinkun Nie · Masatoshi Uehara · Kelly Zhang -
2021 : Representation Learning for Online and Offline RL in Low-rank MDPs »
Masatoshi Uehara · Xuezhou Zhang · Wen Sun -
2021 : Q/A Session »
Wen Sun · Shixia Liu -
2021 : Speaker Introduction »
Wen Sun -
2021 Workshop: eXplainable AI approaches for debugging and diagnosis »
Roberto Capobianco · Biagio La Rosa · Leilani Gilpin · Wen Sun · Alice Xiang · Alexander Feldman -
2021 Poster: Mitigating Covariate Shift in Imitation Learning via Offline Data With Partial Coverage »
Jonathan Chang · Masatoshi Uehara · Dhruv Sreenivas · Rahul Kidambi · Wen Sun -
2020 Poster: Off-Policy Evaluation and Learning for External Validity under a Covariate Shift »
Masatoshi Uehara · Masahiro Kato · Shota Yasui -
2020 Spotlight: Off-Policy Evaluation and Learning for External Validity under a Covariate Shift »
Masatoshi Uehara · Masahiro Kato · Shota Yasui -
2020 Poster: Doubly Robust Off-Policy Value and Gradient Estimation for Deterministic Policies »
Nathan Kallus · Masatoshi Uehara