Timezone: »
Despite its promise, reinforcement learning’s real-world adoption has been hampered by the need for costly exploration to learn a good policy. Imitation learning (IL) mitigates this shortcoming by using an oracle policy during training as a bootstrap to accelerate the learning process. However, in many practical situations, the learner has access to multiple suboptimal oracles, which may provide conflicting advice in a state. The existing IL literature provides a limited treatment of such scenarios. Whereas in the single-oracle case, the return of the oracle’s policy provides an obvious benchmark for the learner to compete against, neither such a benchmark nor principled ways of outperforming it are known for the multi-oracle setting. In this paper, we propose the state-wise maximum of the oracle policies’ values as a natural baseline to resolve conflicting advice from multiple oracles. Using a reduction of policy optimization to online learning, we introduce a novel IL algorithm MAMBA, which can provably learn a policy competitive with this benchmark. In particular, MAMBA optimizes policies by using a gradient estimator in the style of generalized advantage estimation (GAE). Our theoretical analysis shows that this design makes MAMBA robust and enables it to outperform the oracle policies by a larger margin than the IL state of the art, even in the single-oracle case. In an evaluation against standard policy gradient with GAE and AggreVaTe(D), we showcase MAMBA’s ability to leverage demonstrations both from a single and from multiple weak oracles, and significantly speed up policy optimization.
Author Information
Ching-An Cheng (Microsoft Research)
Andrey Kolobov (Microsoft Research)
Alekh Agarwal (Microsoft Research)
Related Events (a corresponding poster, oral, or spotlight)
-
2020 Spotlight: Policy Improvement via Imitation of Multiple Oracles »
Wed. Dec 9th 03:50 -- 04:00 AM Room Orals & Spotlights: Reinforcement Learning
More from the Same Authors
-
2022 : AMORE: A Model-based Framework for Improving Arbitrary Baseline Policies with Offline Data »
Tengyang Xie · Mohak Bhardwaj · Nan Jiang · Ching-An Cheng -
2022 : Provable Benefits of Representational Transfer in Reinforcement Learning »
Alekh Agarwal · Yuda Song · Kaiwen Wang · Mengdi Wang · Wen Sun · Xuezhou Zhang -
2022 Poster: MoCapAct: A Multi-Task Dataset for Simulated Humanoid Control »
Nolan Wagener · Andrey Kolobov · Felipe Vieira Frujeri · Ricky Loynd · Ching-An Cheng · Matthew Hausknecht -
2022 Poster: On the Statistical Efficiency of Reward-Free Exploration in Non-Linear RL »
Jinglin Chen · Aditya Modi · Akshay Krishnamurthy · Nan Jiang · Alekh Agarwal -
2022 Poster: Model-based RL with Optimistic Posterior Sampling: Structural Conditions and Sample Complexity »
Alekh Agarwal · Tong Zhang -
2021 Poster: Bellman-consistent Pessimism for Offline Reinforcement Learning »
Tengyang Xie · Ching-An Cheng · Nan Jiang · Paul Mineiro · Alekh Agarwal -
2021 Poster: Heuristic-Guided Reinforcement Learning »
Ching-An Cheng · Andrey Kolobov · Adith Swaminathan -
2021 Oral: Bellman-consistent Pessimism for Offline Reinforcement Learning »
Tengyang Xie · Ching-An Cheng · Nan Jiang · Paul Mineiro · Alekh Agarwal -
2020 Poster: Intra Order-preserving Functions for Calibration of Multi-Class Neural Networks »
Amir Rahimi · Amirreza Shaban · Ching-An Cheng · Richard I Hartley · Byron Boots -
2020 Poster: FLAMBE: Structural Complexity and Representation Learning of Low Rank MDPs »
Alekh Agarwal · Sham Kakade · Akshay Krishnamurthy · Wen Sun -
2020 Poster: PC-PG: Policy Cover Directed Exploration for Provable Policy Gradient Learning »
Alekh Agarwal · Mikael Henaff · Sham Kakade · Wen Sun -
2020 Oral: FLAMBE: Structural Complexity and Representation Learning of Low Rank MDPs »
Alekh Agarwal · Sham Kakade · Akshay Krishnamurthy · Wen Sun -
2020 Poster: Safe Reinforcement Learning via Curriculum Induction »
Matteo Turchetta · Andrey Kolobov · Shital Shah · Andreas Krause · Alekh Agarwal -
2020 Poster: Provably Good Batch Reinforcement Learning Without Great Exploration »
Yao Liu · Adith Swaminathan · Alekh Agarwal · Emma Brunskill -
2020 Spotlight: Safe Reinforcement Learning via Curriculum Induction »
Matteo Turchetta · Andrey Kolobov · Shital Shah · Andreas Krause · Alekh Agarwal -
2019 : Continuous Online Learning and New Insights to Online Imitation Learning »
Jonathan Lee · Ching-An Cheng · Ken Goldberg · Byron Boots -
2019 : Poster and Coffee Break 1 »
Aaron Sidford · Aditya Mahajan · Alejandro Ribeiro · Alex Lewandowski · Ali H Sayed · Ambuj Tewari · Angelika Steger · Anima Anandkumar · Asier Mujika · Hilbert J Kappen · Bolei Zhou · Byron Boots · Chelsea Finn · Chen-Yu Wei · Chi Jin · Ching-An Cheng · Christina Yu · Clement Gehring · Craig Boutilier · Dahua Lin · Daniel McNamee · Daniel Russo · David Brandfonbrener · Denny Zhou · Devesh Jha · Diego Romeres · Doina Precup · Dominik Thalmeier · Eduard Gorbunov · Elad Hazan · Elena Smirnova · Elvis Dohmatob · Emma Brunskill · Enrique Munoz de Cote · Ethan Waldie · Florian Meier · Florian Schaefer · Ge Liu · Gergely Neu · Haim Kaplan · Hao Sun · Hengshuai Yao · Jalaj Bhandari · James A Preiss · Jayakumar Subramanian · Jiajin Li · Jieping Ye · Jimmy Smith · Joan Bas Serrano · Joan Bruna · John Langford · Jonathan Lee · Jose A. Arjona-Medina · Kaiqing Zhang · Karan Singh · Yuping Luo · Zafarali Ahmed · Zaiwei Chen · Zhaoran Wang · Zhizhong Li · Zhuoran Yang · Ziping Xu · Ziyang Tang · Yi Mao · David Brandfonbrener · Shirli Di-Castro · Riashat Islam · Zuyue Fu · Abhishek Naik · Saurabh Kumar · Benjamin Petit · Angeliki Kamoutsi · Simone Totaro · Arvind Raghunathan · Rui Wu · Donghwan Lee · Dongsheng Ding · Alec Koppel · Hao Sun · Christian Tjandraatmadja · Mahdi Karami · Jincheng Mei · Chenjun Xiao · Junfeng Wen · Zichen Zhang · Ross Goroshin · Mohammad Pezeshki · Jiaqi Zhai · Philip Amortila · Shuo Huang · Mariya Vasileva · El houcine Bergou · Adel Ahmadyan · Haoran Sun · Sheng Zhang · Lukas Gruber · Yuanhao Wang · Tetiana Parshakova -
2019 Poster: Bias Correction of Learned Generative Models using Likelihood-Free Importance Weighting »
Aditya Grover · Jiaming Song · Ashish Kapoor · Kenneth Tran · Alekh Agarwal · Eric Horvitz · Stefano Ermon -
2019 Poster: Staying up to Date with Online Content Changes Using Reinforcement Learning for Scheduling »
Andrey Kolobov · Yuval Peres · Cheng Lu · Eric Horvitz -
2018 Poster: On Oracle-Efficient PAC RL with Rich Observations »
Christoph Dann · Nan Jiang · Akshay Krishnamurthy · Alekh Agarwal · John Langford · Robert Schapire -
2018 Poster: Orthogonally Decoupled Variational Gaussian Processes »
Hugh Salimbeni · Ching-An Cheng · Byron Boots · Marc Deisenroth -
2018 Spotlight: On Oracle-Efficient PAC RL with Rich Observations »
Christoph Dann · Nan Jiang · Akshay Krishnamurthy · Alekh Agarwal · John Langford · Robert Schapire -
2017 Workshop: OPT 2017: Optimization for Machine Learning »
Suvrit Sra · Sashank J. Reddi · Alekh Agarwal · Benjamin Recht -
2017 Poster: Off-policy evaluation for slate recommendation »
Adith Swaminathan · Akshay Krishnamurthy · Alekh Agarwal · Miro Dudik · John Langford · Damien Jose · Imed Zitouni -
2017 Oral: Off-policy evaluation for slate recommendation »
Adith Swaminathan · Akshay Krishnamurthy · Alekh Agarwal · Miro Dudik · John Langford · Damien Jose · Imed Zitouni -
2017 Poster: Variational Inference for Gaussian Process Models with Linear Complexity »
Ching-An Cheng · Byron Boots -
2016 Demonstration: Project Malmo - Minecraft for AI Research »
Katja Hofmann · Matthew A Johnson · Fernando Diaz · Alekh Agarwal · Tim Hutton · David Bignell · Evelyne Viegas -
2016 Poster: Incremental Variational Sparse Gaussian Process Regression »
Ching-An Cheng · Byron Boots -
2016 Poster: Efficient Second Order Online Learning by Sketching »
Haipeng Luo · Alekh Agarwal · Nicolò Cesa-Bianchi · John Langford -
2016 Poster: Contextual semibandits via supervised learning oracles »
Akshay Krishnamurthy · Alekh Agarwal · Miro Dudik -
2016 Poster: PAC Reinforcement Learning with Rich Observations »
Akshay Krishnamurthy · Alekh Agarwal · John Langford -
2015 Workshop: Optimization for Machine Learning (OPT2015) »
Suvrit Sra · Alekh Agarwal · Leon Bottou · Sashank J. Reddi -
2015 Poster: Efficient and Parsimonious Agnostic Active Learning »
Tzu-Kuo Huang · Alekh Agarwal · Daniel Hsu · John Langford · Robert Schapire -
2015 Spotlight: Efficient and Parsimonious Agnostic Active Learning »
Tzu-Kuo Huang · Alekh Agarwal · Daniel Hsu · John Langford · Robert Schapire -
2015 Poster: Fast Convergence of Regularized Learning in Games »
Vasilis Syrgkanis · Alekh Agarwal · Haipeng Luo · Robert Schapire -
2015 Oral: Fast Convergence of Regularized Learning in Games »
Vasilis Syrgkanis · Alekh Agarwal · Haipeng Luo · Robert Schapire -
2014 Workshop: OPT2014: Optimization for Machine Learning »
Zaid Harchaoui · Suvrit Sra · Alekh Agarwal · Martin Jaggi · Miro Dudik · Aaditya Ramdas · Jean Lasserre · Yoshua Bengio · Amir Beck -
2014 Poster: Scalable Non-linear Learning with Adaptive Polynomial Expansions »
Alekh Agarwal · Alina Beygelzimer · Daniel Hsu · John Langford · Matus J Telgarsky -
2013 Workshop: Learning Faster From Easy Data »
Peter Grünwald · Wouter M Koolen · Sasha Rakhlin · Nati Srebro · Alekh Agarwal · Karthik Sridharan · Tim van Erven · Sebastien Bubeck -
2013 Workshop: OPT2013: Optimization for Machine Learning »
Suvrit Sra · Alekh Agarwal -
2012 Workshop: Optimization for Machine Learning »
Suvrit Sra · Alekh Agarwal -
2012 Poster: Stochastic optimization and sparse statistical recovery: Optimal algorithms for high dimensions »
Alekh Agarwal · Sahand N Negahban · Martin J Wainwright -
2011 Workshop: Computational Trade-offs in Statistical Learning »
Alekh Agarwal · Sasha Rakhlin -
2011 Poster: Distributed Delayed Stochastic Optimization »
Alekh Agarwal · John Duchi -
2011 Poster: Stochastic convex optimization with bandit feedback »
Alekh Agarwal · Dean P Foster · Daniel Hsu · Sham M Kakade · Sasha Rakhlin -
2010 Workshop: Learning on Cores, Clusters, and Clouds »
Alekh Agarwal · Lawrence Cayton · Ofer Dekel · John Duchi · John Langford -
2010 Spotlight: Distributed Dual Averaging In Networks »
John Duchi · Alekh Agarwal · Martin J Wainwright -
2010 Poster: Distributed Dual Averaging In Networks »
John Duchi · Alekh Agarwal · Martin J Wainwright -
2010 Oral: Fast global convergence rates of gradient methods for high-dimensional statistical recovery »
Alekh Agarwal · Sahand N Negahban · Martin J Wainwright -
2010 Poster: Fast global convergence rates of gradient methods for high-dimensional statistical recovery »
Alekh Agarwal · Sahand N Negahban · Martin J Wainwright -
2009 Poster: Information-theoretic lower bounds on the oracle complexity of convex optimization »
Alekh Agarwal · Peter Bartlett · Pradeep Ravikumar · Martin J Wainwright -
2009 Spotlight: Information-theoretic lower bounds on the oracle complexity of convex optimization »
Alekh Agarwal · Peter Bartlett · Pradeep Ravikumar · Martin J Wainwright -
2007 Poster: An Analysis of Inference with the Universum »
Fabian H Sinz · Olivier Chapelle · Alekh Agarwal · Bernhard Schölkopf -
2007 Spotlight: An Analysis of Inference with the Universum »
Fabian H Sinz · Olivier Chapelle · Alekh Agarwal · Bernhard Schölkopf