Timezone: »
Poster
Society of Agents: Regret Bounds of Concurrent Thompson Sampling
Yan Chen · Perry Dong · Qinxun Bai · Maria Dimakopoulou · Wei Xu · Zhengyuan Zhou
We consider the concurrent reinforcement learning problem where $n$ agents simultaneously learn to make decisions in the same environment by sharing experience with each other. Existing works in this emerging area have empirically demonstrated that Thompson sampling (TS) based algorithms provide a particularly attractive alternative for inducing cooperation, because each agent can independently sample a belief environment (and compute a corresponding optimal policy) from the joint posterior computed by aggregating all agents' data , which induces diversity in exploration among agents while benefiting shared experience from all agents. However, theoretical guarantees in this area remain underexplored; in particular, no regret bound is known on TS based concurrent RL algorithms. In this paper, we fill in this gap by considering two settings. In the first, we study the simple finitehorizon episodic RL setting, where TS is naturally adapted into the concurrent setup by having each agent sample from the current joint posterior at the beginning of each episode. We establish a $\tilde{O}(HS\sqrt{\frac{AT}{n}})$ peragent regret bound, where $H$ is the horizon of the episode, $S$ is the number of states, $A$ is the number of actions, $T$ is the number of episodes and $n$ is the number of agents. In the second setting, we consider the infinitehorizon RL problem, where a policy is measured by its longrun average reward. Here, despite not having natural episodic breakpoints, we show that by a doublinghorizon schedule, we can adapt TS to the infinitehorizon concurrent learning setting to achieve a regret bound of $\tilde{O}(DS\sqrt{ATn})$, where $D$ is the standard notion of diameter of the underlying MDP and $T$ is the number of timesteps. Note that in both settings, the peragent regret decreases at an optimal rate of $\Theta(\frac{1}{\sqrt{n}})$, which manifests the power of cooperation in concurrent RL.
Author Information
Yan Chen (Duke University)
Perry Dong (University of California, Berkeley)
Qinxun Bai (Horizon Robotics)
Maria Dimakopoulou (Netflix)
Wei Xu (Horizon Robotics)
Zhengyuan Zhou (Arena Technologies & NYU)
More from the Same Authors

2022 : Offline Reinforcement Learning with ClosedForm Policy Improvement Operators »
Jiachen Li · Edwin Zhang · Ming Yin · Qinxun Bai · YuXiang Wang · William Yang Wang 
2022 : Offpolicy Reinforcement Learning with Optimistic Exploration and Distribution Correction »
Jiachen Li · Shuo Cheng · Zhenyu Liao · Huayan Wang · William Yang Wang · Qinxun Bai 
2022 Spotlight: Leveraging the Hints: Adaptive Bidding in Repeated FirstPrice Auctions »
Wei Zhang · Yanjun Han · Zhengyuan Zhou · Aaron Flores · Tsachy Weissman 
2022 Spotlight: Lightning Talks 3B1 »
Tianying Ji · Tongda Xu · Giulia Denevi · Aibek Alanov · Martin Wistuba · Wei Zhang · Yuesong Shen · Massimiliano Pontil · Vadim Titov · Yan Wang · Yu Luo · Daniel Cremers · Yanjun Han · Arlind Kadra · Dailan He · Josif Grabocka · Zhengyuan Zhou · Fuchun Sun · Carlo Ciliberto · Dmitry Vetrov · Mingxuan Jing · Chenjian Gao · Aaron Flores · Tsachy Weissman · Han Gao · Fengxiang He · Kunzan Liu · Wenbing Huang · Hongwei Qin 
2022 Poster: Towards Safe Reinforcement Learning with a Safety Editor Policy »
Haonan Yu · Wei Xu · Haichao Zhang 
2022 Poster: Leveraging the Hints: Adaptive Bidding in Repeated FirstPrice Auctions »
Wei Zhang · Yanjun Han · Zhengyuan Zhou · Aaron Flores · Tsachy Weissman 
2022 Poster: PaCo: ParameterCompositional Multitask Reinforcement Learning »
Lingfeng Sun · Haichao Zhang · Wei Xu · Masayoshi TOMIZUKA 
2021 Poster: TAAC: Temporally Abstract ActorCritic for Continuous Control »
Haonan Yu · Wei Xu · Haichao Zhang 
2021 Poster: Online MultiArmed Bandits with Adaptive Inference »
Maria Dimakopoulou · Zhimei Ren · Zhengyuan Zhou 
2020 : Live Q&A »
Yves Raimond · Cristina Segalin · Dong Liu · Selen Uguroglu · Benoit Rostykus · Harald Steck · Avneesh Saluja · Maria Dimakopoulou · Justin Basilico 
2020 : Slate Bandit Learning & Evaluation »
Maria Dimakopoulou 
2013 Poster: Simultaneous Rectification and Alignment via Robust Recovery of Lowrank Tensors »
Xiaoqin Zhang · Di Wang · Zhengyuan Zhou · Yi Ma