Timezone: »
Poster
Towards InstanceOptimal Offline Reinforcement Learning with Pessimism
Ming Yin · YuXiang Wang
We study the \emph{offline reinforcement learning} (offline RL) problem, where the goal is to learn a rewardmaximizing policy in an unknown \emph{Markov Decision Process} (MDP) using the data coming from a policy $\mu$. In particular, we consider the sample complexity problems of offline RL for the finite horizon MDPs. Prior works derive the informationtheoretical lower bounds based on different datacoverage assumptions and their upper bounds are expressed by the covering coefficients which lack the explicit characterization of system quantities. In this work, we analyze the \emph{Adaptive Pessimistic Value Iteration} (APVI) algorithm and derive the suboptimality upper bound that nearly matches\[O\left(\sum_{h=1}^H\sum_{s_h,a_h}d^{\pi^\star}_h(s_h,a_h)\sqrt{\frac{\mathrm{Var}_{P_{s_h,a_h}}{(V^\star_{h+1}+r_h)}}{d^\mu_h(s_h,a_h)}}\sqrt{\frac{1}{n}}\right).\]We also prove an informationtheoretical lower bound to show this quantity is required under the weak assumption that $d^\mu_h(s_h,a_h)>0$ if $d^{\pi^\star}_h(s_h,a_h)>0$. Here $\pi^\star$ is a optimal policy, $\mu$ is the behavior policy and $d(s_h,a_h)$ is the marginal stateaction probability. We call this adaptive bound the \emph{intrinsic offline reinforcement learning bound} since it directly implies all the existing optimal results: minimax rate under uniform datacoverage assumption, horizonfree setting, single policy concentrability, and the tight problemdependent results. Later, we extend the result to the \emph{assumptionfree} regime (where we make no assumption on $\mu$) and obtain the assumptionfree intrinsic bound. Due to its generic form, we believe the intrinsic bound could help illuminate what makes a specific problem hard and reveal the fundamental challenges in offline RL.
Author Information
Ming Yin (UC Santa Barbara)
YuXiang Wang (UC Santa Barbara)
More from the Same Authors

2021 Spotlight: Logarithmic Regret in Featurebased Dynamic Pricing »
Jianyu Xu · YuXiang Wang 
2021 : Instancedependent Offline Reinforcement Learning: From tabular RL to linear MDPs »
Ming Yin · YuXiang Wang 
2021 Poster: Privately Publishable Perinstance Privacy »
Rachel Redberg · YuXiang Wang 
2021 Poster: Logarithmic Regret in Featurebased Dynamic Pricing »
Jianyu Xu · YuXiang Wang 
2021 Poster: Optimal Uniform OPE and Modelbased Offline Reinforcement Learning in TimeHomogeneous, RewardFree and TaskAgnostic Settings »
Ming Yin · YuXiang Wang 
2021 Poster: NearOptimal Offline Reinforcement Learning via Double Variance Reduction »
Ming Yin · Yu Bai · YuXiang Wang 
2017 Poster: HigherOrder Total Variation Classes on Grids: Minimax Theory and Trend Filtering Methods »
Veeranjaneyulu Sadhanala · YuXiang Wang · James Sharpnack · Ryan Tibshirani 
2016 : Optimal and Adaptive Offpolicy Evaluation in Contextual Bandits »
YuXiang Wang 
2016 Poster: Total Variation Classes Beyond 1d: Minimax Rates, and the Limitations of Linear Smoothers »
Veeranjaneyulu Sadhanala · YuXiang Wang · Ryan Tibshirani 
2015 : YuXiang Wang: Learning with differential privacy: stability, learnability and the sufficiency and necessity of ERM principle »
YuXiang Wang 
2015 Poster: Differentially private subspace clustering »
Yining Wang · YuXiang Wang · Aarti Singh