Timezone: »
Poster
Society of Agents: Regret Bounds of Concurrent Thompson Sampling
Yan Chen · Perry Dong · Qinxun Bai · Maria Dimakopoulou · Wei Xu · Zhengyuan Zhou
We consider the concurrent reinforcement learning problem where $n$ agents simultaneously learn to make decisions in the same environment by sharing experience with each other. Existing works in this emerging area have empirically demonstrated that Thompson sampling (TS) based algorithms provide a particularly attractive alternative for inducing cooperation, because each agent can independently sample a belief environment (and compute a corresponding optimal policy) from the joint posterior computed by aggregating all agents' data , which induces diversity in exploration among agents while benefiting shared experience from all agents. However, theoretical guarantees in this area remain under-explored; in particular, no regret bound is known on TS based concurrent RL algorithms. In this paper, we fill in this gap by considering two settings. In the first, we study the simple finite-horizon episodic RL setting, where TS is naturally adapted into the concurrent setup by having each agent sample from the current joint posterior at the beginning of each episode. We establish a $\tilde{O}(HS\sqrt{\frac{AT}{n}})$ per-agent regret bound, where $H$ is the horizon of the episode, $S$ is the number of states, $A$ is the number of actions, $T$ is the number of episodes and $n$ is the number of agents. In the second setting, we consider the infinite-horizon RL problem, where a policy is measured by its long-run average reward. Here, despite not having natural episodic breakpoints, we show that by a doubling-horizon schedule, we can adapt TS to the infinite-horizon concurrent learning setting to achieve a regret bound of $\tilde{O}(DS\sqrt{ATn})$, where $D$ is the standard notion of diameter of the underlying MDP and $T$ is the number of timesteps. Note that in both settings, the per-agent regret decreases at an optimal rate of $\Theta(\frac{1}{\sqrt{n}})$, which manifests the power of cooperation in concurrent RL.
Author Information
Yan Chen (Duke University)
Perry Dong (University of California, Berkeley)
Qinxun Bai (Horizon Robotics)
Maria Dimakopoulou (Netflix)
Wei Xu (Horizon Robotics)
Zhengyuan Zhou (Arena Technologies & NYU)
More from the Same Authors
-
2022 : Offline Reinforcement Learning with Closed-Form Policy Improvement Operators »
Jiachen Li · Edwin Zhang · Ming Yin · Qinxun Bai · Yu-Xiang Wang · William Yang Wang -
2022 : Off-policy Reinforcement Learning with Optimistic Exploration and Distribution Correction »
Jiachen Li · Shuo Cheng · Zhenyu Liao · Huayan Wang · William Yang Wang · Qinxun Bai -
2022 Spotlight: Leveraging the Hints: Adaptive Bidding in Repeated First-Price Auctions »
Wei Zhang · Yanjun Han · Zhengyuan Zhou · Aaron Flores · Tsachy Weissman -
2022 Spotlight: Lightning Talks 3B-1 »
Tianying Ji · Tongda Xu · Giulia Denevi · Aibek Alanov · Martin Wistuba · Wei Zhang · Yuesong Shen · Massimiliano Pontil · Vadim Titov · Yan Wang · Yu Luo · Daniel Cremers · Yanjun Han · Arlind Kadra · Dailan He · Josif Grabocka · Zhengyuan Zhou · Fuchun Sun · Carlo Ciliberto · Dmitry Vetrov · Mingxuan Jing · Chenjian Gao · Aaron Flores · Tsachy Weissman · Han Gao · Fengxiang He · Kunzan Liu · Wenbing Huang · Hongwei Qin -
2022 Poster: Towards Safe Reinforcement Learning with a Safety Editor Policy »
Haonan Yu · Wei Xu · Haichao Zhang -
2022 Poster: Leveraging the Hints: Adaptive Bidding in Repeated First-Price Auctions »
Wei Zhang · Yanjun Han · Zhengyuan Zhou · Aaron Flores · Tsachy Weissman -
2022 Poster: PaCo: Parameter-Compositional Multi-task Reinforcement Learning »
Lingfeng Sun · Haichao Zhang · Wei Xu · Masayoshi TOMIZUKA -
2021 Poster: TAAC: Temporally Abstract Actor-Critic for Continuous Control »
Haonan Yu · Wei Xu · Haichao Zhang -
2021 Poster: Online Multi-Armed Bandits with Adaptive Inference »
Maria Dimakopoulou · Zhimei Ren · Zhengyuan Zhou -
2020 : Live Q&A »
Yves Raimond · Cristina Segalin · Dong Liu · Selen Uguroglu · Benoit Rostykus · Harald Steck · Avneesh Saluja · Maria Dimakopoulou · Justin Basilico -
2020 : Slate Bandit Learning & Evaluation »
Maria Dimakopoulou -
2013 Poster: Simultaneous Rectification and Alignment via Robust Recovery of Low-rank Tensors »
Xiaoqin Zhang · Di Wang · Zhengyuan Zhou · Yi Ma