Timezone: »

Optimistic posterior sampling for reinforcement learning: worst-case regret bounds
Shipra Agrawal · Randy Jia

Tue Dec 05 06:30 PM -- 10:30 PM (PST) @ Pacific Ballroom #1
We present an algorithm based on posterior sampling (aka Thompson sampling) that achieves near-optimal worst-case regret bounds when the underlying Markov Decision Process (MDP) is communicating with a finite, though unknown, diameter. Our main result is a high probability regret upper bound of $\tilde{O}(D\sqrt{SAT})$ for any communicating MDP with $S$ states, $A$ actions and diameter $D$, when $T\ge S^5A$. Here, regret compares the total reward achieved by the algorithm to the total expected reward of an optimal infinite-horizon undiscounted average reward policy, in time horizon $T$. This result improves over the best previously known upper bound of $\tilde{O}(DS\sqrt{AT})$ achieved by any algorithm in this setting, and matches the dependence on $S$ in the established lower bound of $\Omega(\sqrt{DSAT})$ for this problem. Our techniques involve proving some novel results about the anti-concentration of Dirichlet distribution, which may be of independent interest.

Author Information

Shipra Agrawal (Columbia University)
Randy Jia (Columbia University)

Related Events (a corresponding poster, oral, or spotlight)

More from the Same Authors