Timezone: »
Poster
Variational Bayesian Reinforcement Learning with Regret Bounds
Brendan O'Donoghue
We consider the explorationexploitation tradeoff in reinforcement learning and show that an agent endowed with an exponential epistemicriskseeking utility function explores efficiently, as measured by regret. The stateaction values induced by the exponential utility satisfy a Bellman recursion, so we can use dynamic programming to compute them. We call the resulting algorithm Klearning (for knowledge) and the riskseeking utility ensures that the associated stateaction values (Kvalues) are optimistic for the expected optimal Qvalues under the posterior. The exponential utility function induces a Boltzmann exploration policy for which the 'temperature' parameter is equal to the riskseeking parameter and is carefully controlled to yield a Bayes regret bound of $\tilde O(L^{3/2} \sqrt{S A T})$, where $L$ is the time horizon, $S$ is the number of states, $A$ is the number of actions, and $T$ is the total number of elapsed timesteps. We conclude with a numerical example demonstrating that Klearning is competitive with other stateoftheart algorithms in practice.
Author Information
Brendan O'Donoghue (DeepMind)
More from the Same Authors

2021 Spotlight: Variational Bayesian Optimistic Sampling »
Brendan O'Donoghue · Tor Lattimore 
2021 Spotlight: Reward is enough for convex MDPs »
Tom Zahavy · Brendan O'Donoghue · Guillaume Desjardins · Satinder Singh 
2021 Poster: Reward is enough for convex MDPs »
Tom Zahavy · Brendan O'Donoghue · Guillaume Desjardins · Satinder Singh 
2021 Poster: Practical LargeScale Linear Programming using PrimalDual Hybrid Gradient »
David Applegate · Mateo Diaz · Oliver Hinder · Haihao Lu · Miles Lubin · Brendan O'Donoghue · Warren Schudy 
2021 Poster: Variational Bayesian Optimistic Sampling »
Brendan O'Donoghue · Tor Lattimore 
2019 Poster: Hamiltonian descent for composite objectives »
Brendan O'Donoghue · Chris Maddison