Timezone: »
Poster
SampleEfficient Reinforcement Learning for LinearlyParameterized MDPs with a Generative Model
Bingyan Wang · Yuling Yan · Jianqing Fan
The curse of dimensionality is a widely known issue in reinforcement learning (RL). In the tabular setting where the state space $\mathcal{S}$ and the action space $\mathcal{A}$ are both finite, to obtain a near optimal policy with sampling access to a generative model, the minimax optimal sample complexity scales linearly with $\mathcal{S}\times\mathcal{A}$, which can be prohibitively large when $\mathcal{S}$ or $\mathcal{A}$ is large. This paper considers a Markov decision process (MDP) that admits a set of stateaction features, which can linearly express (or approximate) its probability transition kernel. We show that a modelbased approach (resp.$~$Qlearning) provably learns an $\varepsilon$optimal policy (resp.$~$Qfunction) with high probability as soon as the sample size exceeds the order of $\frac{K}{(1\gamma)^{3}\varepsilon^{2}}$ (resp.$~$$\frac{K}{(1\gamma)^{4}\varepsilon^{2}}$), up to some logarithmic factor. Here $K$ is the feature dimension and $\gamma\in(0,1)$ is the discount factor of the MDP. Both sample complexity bounds are provably tight, and our result for the modelbased approach matches the minimax lower bound. Our results show that for arbitrarily largescale MDP, both the modelbased approach and Qlearning are sampleefficient when $K$ is relatively small, and hence the title of this paper.
Author Information
Bingyan Wang (Princeton University)
Yuling Yan (Princeton University)
Jianqing Fan (Princeton University)
More from the Same Authors

2021 Poster: Curriculum Learning for VisionandLanguage Navigation »
Jiwen Zhang · zhongyu wei · Jianqing Fan · Jiajie Peng 
2020 Poster: Efficient Clustering for Stretched Mixtures: Landscape and Optimality »
Kaizheng Wang · Yuling Yan · Mateo Diaz