`

Timezone: »

 
Poster
Policy Optimization in Adversarial MDPs: Improved Exploration via Dilated Bonuses
Haipeng Luo · Chen-Yu Wei · Chung-Wei Lee

Tue Dec 07 04:30 PM -- 06:00 PM (PST) @ None #None
Policy optimization is a widely-used method in reinforcement learning. Due to its local-search nature, however, theoretical guarantees on global optimality often rely on extra assumptions on the Markov Decision Processes (MDPs) that bypass the challenge of global exploration. To eliminate the need of such assumptions, in this work, we develop a general solution that adds dilated bonuses to the policy update to facilitate global exploration. To showcase the power and generality of this technique, we apply it to several episodic MDP settings with adversarial losses and bandit feedback, improving and generalizing the state-of-the-art. Specifically, in the tabular case, we obtain $\widetilde{\mathcal{O}}(\sqrt{T})$ regret where $T$ is the number of episodes, improving the $\widetilde{\mathcal{O}}({T}^{\frac{2}{3}})$ regret bound by Shani et al. [2020]. When the number of states is infinite, under the assumption that the state-action values are linear in some low-dimensional features, we obtain $\widetilde{\mathcal{O}}({T}^{\frac{2}{3}})$ regret with the help of a simulator, matching the result of Neu and Olkhovskaya [2020] while importantly removing the need of an exploratory policy that their algorithm requires. To our knowledge, this is the first algorithm with sublinear regret for linear function approximation with adversarial losses, bandit feedback, and no exploratory assumptions. Finally, we also discuss how to further improve the regret or remove the need of a simulator using dilated bonuses, when an exploratory policy is available.

Author Information

Haipeng Luo (University of Southern California)
Chen-Yu Wei (University of Southern California)
Chung-Wei Lee (University of Southern California)

More from the Same Authors

  • 2021 Poster: The best of both worlds: stochastic and adversarial episodic MDPs with unknown transition »
    Tiancheng Jin · Longbo Huang · Haipeng Luo
  • 2021 Poster: Last-iterate Convergence in Extensive-Form Games »
    Chung-Wei Lee · Christian Kroer · Haipeng Luo
  • 2021 Poster: Implicit Finite-Horizon Approximation and Efficient Optimal Algorithms for Stochastic Shortest Path »
    Liyu Chen · Mehdi Jafarnia-Jahromi · Rahul Jain · Haipeng Luo
  • 2021 Oral: The best of both worlds: stochastic and adversarial episodic MDPs with unknown transition »
    Tiancheng Jin · Longbo Huang · Haipeng Luo
  • 2020 Poster: Bias no more: high-probability data-dependent regret bounds for adversarial bandits and MDPs »
    Chung-Wei Lee · Haipeng Luo · Chen-Yu Wei · Mengxiao Zhang
  • 2020 Poster: Simultaneously Learning Stochastic and Adversarial Episodic MDPs with Known Transition »
    Tiancheng Jin · Haipeng Luo
  • 2020 Spotlight: Simultaneously Learning Stochastic and Adversarial Episodic MDPs with Known Transition »
    Tiancheng Jin · Haipeng Luo
  • 2020 Oral: Bias no more: high-probability data-dependent regret bounds for adversarial bandits and MDPs »
    Chung-Wei Lee · Haipeng Luo · Chen-Yu Wei · Mengxiao Zhang
  • 2020 Poster: Comparator-Adaptive Convex Bandits »
    Dirk van der Hoeven · Ashok Cutkosky · Haipeng Luo
  • 2019 : Poster and Coffee Break 2 »
    Karol Hausman · Kefan Dong · Ken Goldberg · Lihong Li · Lin Yang · Lingxiao Wang · Lior Shani · Liwei Wang · Loren Amdahl-Culleton · Lucas Cassano · Marc Dymetman · Marc Bellemare · Marcin Tomczak · Margarita Castro · Marius Kloft · Marius-Constantin Dinu · Markus Holzleitner · Martha White · Mengdi Wang · Michael Jordan · Mihailo Jovanovic · Ming Yu · Minshuo Chen · Moonkyung Ryu · Muhammad Zaheer · Naman Agarwal · Nan Jiang · Niao He · Niko Yasui · Nikos Karampatziakis · Nino Vieillard · Ofir Nachum · Olivier Pietquin · Ozan Sener · Pan Xu · Parameswaran Kamalaruban · Paul Mineiro · Paul Rolland · Philip Amortila · Pierre-Luc Bacon · Prakash Panangaden · Qi Cai · Qiang Liu · Quanquan Gu · Raihan Seraj · Richard Sutton · Rick Valenzano · Robert Dadashi · Rodrigo Toro Icarte · Roshan Shariff · Roy Fox · Ruosong Wang · Saeed Ghadimi · Samuel Sokota · Sean Sinclair · Sepp Hochreiter · Sergey Levine · Sergio Valcarcel Macua · Sham Kakade · Shangtong Zhang · Sheila McIlraith · Shie Mannor · Shimon Whiteson · Shuai Li · Shuang Qiu · Sibon Li · Sid Banerjee · Sitao Luan · Tamer Basar · Thinh Doan · Tianhe (Kevin) Yu · Tianyi Liu · Tom Zahavy · Toryn Klassen · Tuo Zhao · Vicenç Gómez · Vincent Liu · Volkan Cevher · Wesley Suttle · Xiao-Wen Chang · Xiaohan Wei · Xiaotong Liu · Xingguo Li · Xinyi Chen · Xingyou Song · Yao Liu · YiDing Jiang · Yihao Feng · Yilun Du · Yinlam Chow · Yinyu Ye · Yishay Mansour · yixuan.lin.1 · Yonathan Efroni · Yongxin Chen · Yuanhao Wang · Bo Dai · Chen-Yu Wei · Harsh Shrivastava · Hongyang Zhang · Qinqing Zheng · Satpathi SATPATHI · Xueqing Liu · Andreu Vall
  • 2019 : Poster and Coffee Break 1 »
    Aaron Sidford · Aditya Mahajan · Alejandro Ribeiro · Alex Lewandowski · Ali H Sayed · Ambuj Tewari · Angelika Steger · Anima Anandkumar · Asier Mujika · Hilbert J Kappen · Bolei Zhou · Byron Boots · Chelsea Finn · Chen-Yu Wei · Chi Jin · Ching-An Cheng · Christina Yu · Clement Gehring · Craig Boutilier · Dahua Lin · Daniel McNamee · Daniel Russo · David Brandfonbrener · Denny Zhou · Devesh Jha · Diego Romeres · Doina Precup · Dominik Thalmeier · Eduard Gorbunov · Elad Hazan · Elena Smirnova · Elvis Dohmatob · Emma Brunskill · Enrique Munoz de Cote · Ethan Waldie · Florian Meier · Florian Schaefer · Ge Liu · Gergely Neu · Haim Kaplan · Hao Sun · Hengshuai Yao · Jalaj Bhandari · James A Preiss · Jayakumar Subramanian · Jiajin Li · Jieping Ye · Jimmy Smith · Joan Bas Serrano · Joan Bruna · John Langford · Jonathan Lee · Jose A. Arjona-Medina · Kaiqing Zhang · Karan Singh · Yuping Luo · Zafarali Ahmed · Zaiwei Chen · Zhaoran Wang · zz Li · Zhuoran Yang · Ziping Xu · Ziyang Tang · Yi Mao · David Brandfonbrener · Shirli Di-Castro · Riashat Islam · Zuyue Fu · Abhishek Naik · Saurabh Kumar · Benjamin Petit · Angeliki Kamoutsi · Simone Totaro · Arvind Raghunathan · Rui Wu · Donghwan Lee · Dongsheng Ding · Alec Koppel · Hao Sun · Christian Tjandraatmadja · Mahdi Karami · Jincheng Mei · Chenjun Xiao · Junfeng Wen · Vincent Zhang · Ross Goroshin · Mohammad Pezeshki · Jiaqi Zhai · Philip Amortila · Shuo Huang · Mariya Vasileva · El houcine Bergou · Adel Ahmadyan · Haoran Sun · Sheng Zhang · Lukas Gruber · Yuanhao Wang · Tetiana Parshakova
  • 2019 Poster: Equipping Experts/Bandits with Long-term Memory »
    Kai Zheng · Haipeng Luo · Ilias Diakonikolas · Liwei Wang
  • 2019 Poster: Model Selection for Contextual Bandits »
    Dylan Foster · Akshay Krishnamurthy · Haipeng Luo
  • 2019 Spotlight: Model Selection for Contextual Bandits »
    Dylan Foster · Akshay Krishnamurthy · Haipeng Luo
  • 2019 Poster: Hypothesis Set Stability and Generalization »
    Dylan Foster · Spencer Greenberg · Satyen Kale · Haipeng Luo · Mehryar Mohri · Karthik Sridharan
  • 2018 Poster: Efficient Online Portfolio with Logarithmic Regret »
    Haipeng Luo · Chen-Yu Wei · Kai Zheng
  • 2018 Spotlight: Efficient Online Portfolio with Logarithmic Regret »
    Haipeng Luo · Chen-Yu Wei · Kai Zheng