Timezone: »
In recent years, reinforcement learning systems with general goals beyond a cumulative sum of rewards have gained traction, such as in constrained problems, exploration, and acting upon prior experiences. In this paper, we consider policy optimization in Markov Decision Problems, where the objective is a general utility function of the state-action occupancy measure, which subsumes several of the aforementioned examples as special cases. Such generality invalidates the Bellman equation. As this means that dynamic programming no longer works, we focus on direct policy search. Analogously to the Policy Gradient Theorem \cite{sutton2000policy} available for RL with cumulative rewards, we derive a new Variational Policy Gradient Theorem for RL with general utilities, which establishes that the gradient may be obtained as the solution of a stochastic saddle point problem involving the Fenchel dual of the utility function. We develop a variational Monte Carlo gradient estimation algorithm to compute the policy gradient based on sample paths. Further, we prove that the variational policy gradient scheme converges globally to the optimal policy for the general objective, and we also establish its rate of convergence that matches or improves the convergence rate available in the case of RL with cumulative rewards.
Author Information
Junyu Zhang (Princeton University)
Alec Koppel (U.S. Army Research Laboratory)
Amrit Singh Bedi (US Army Research Laboratory)
Csaba Szepesvari (DeepMind / University of Alberta)
Mengdi Wang (Princeton University)
Mengdi Wang is interested in data-driven stochastic optimization and applications in machine and reinforcement learning. She received her PhD in Electrical Engineering and Computer Science from Massachusetts Institute of Technology in 2013. At MIT, Mengdi was affiliated with the Laboratory for Information and Decision Systems and was advised by Dimitri P. Bertsekas. Mengdi became an assistant professor at Princeton in 2014. She received the Young Researcher Prize in Continuous Optimization of the Mathematical Optimization Society in 2016 (awarded once every three years).
Related Events (a corresponding poster, oral, or spotlight)
-
2020 Poster: Variational Policy Gradient Method for Reinforcement Learning with General Utilities »
Tue. Dec 8th 05:00 -- 07:00 PM Room Poster Session 1 #204
More from the Same Authors
-
2021 Spotlight: On the Convergence and Sample Efficiency of Variance-Reduced Policy Gradient Method »
Junyu Zhang · Chengzhuo Ni · zheng Yu · Csaba Szepesvari · Mengdi Wang -
2022 : Provably Efficient Reinforcement Learning for Online Adaptive Influence Maximization »
Kaixuan Huang · Yu Wu · Xuezhou Zhang · Shenyinying Tu · Qingyun Wu · Mengdi Wang · Huazheng Wang -
2022 : Provably Efficient Reinforcement Learning for Online Adaptive Influence Maximization »
Kaixuan Huang · Yu Wu · Xuezhou Zhang · Shenyinying Tu · Qingyun Wu · Mengdi Wang · Huazheng Wang -
2022 : Pareto-Optimal Diagnostic Policy Learning in Clinical Applications via Semi-Model-Based Deep Reinforcement Learning »
zheng Yu · Yikuan Li · Joseph Kim · Kaixuan Huang · Yuan Luo · Mengdi Wang -
2022 : Pareto-Optimal Diagnostic Policy Learning in Clinical Applications via Semi-Model-Based Deep Reinforcement Learning »
zheng Yu · Yikuan Li · Joseph Kim · Kaixuan Huang · Yuan Luo · Mengdi Wang -
2022 : Provable Benefits of Representational Transfer in Reinforcement Learning »
Alekh Agarwal · Yuda Song · Kaiwen Wang · Mengdi Wang · Wen Sun · Xuezhou Zhang -
2022 : Posterior Coreset Construction with Kernelized Stein Discrepancy for Model-Based Reinforcement Learning »
Souradip Chakraborty · Amrit Bedi · Alec Koppel · Furong Huang · Pratap Tokekar · Dinesh Manocha -
2022 Poster: Communication Efficient Distributed Learning for Kernelized Contextual Bandits »
Chuanhao Li · Huazheng Wang · Mengdi Wang · Hongning Wang -
2022 Poster: The Role of Baselines in Policy Gradient Optimization »
Jincheng Mei · Wesley Chung · Valentin Thomas · Bo Dai · Csaba Szepesvari · Dale Schuurmans -
2022 Poster: Sample-Efficient Reinforcement Learning of Partially Observable Markov Games »
Qinghua Liu · Csaba Szepesvari · Chi Jin -
2022 Poster: Confident Approximate Policy Iteration for Efficient Local Planning in $q^\pi$-realizable MDPs »
Gellért Weisz · András György · Tadashi Kozuno · Csaba Szepesvari -
2022 Poster: Near-Optimal Sample Complexity Bounds for Constrained MDPs »
Sharan Vaswani · Lin Yang · Csaba Szepesvari -
2022 Poster: Decentralized Gossip-Based Stochastic Bilevel Optimization over Communication Networks »
Shuoguang Yang · Xuezhou Zhang · Mengdi Wang -
2022 Poster: Bandit Theory and Thompson Sampling-Guided Directed Evolution for Sequence Optimization »
Hui Yuan · Chengzhuo Ni · Huazheng Wang · Xuezhou Zhang · Le Cong · Csaba Szepesvari · Mengdi Wang -
2021 Poster: No Regrets for Learning the Prior in Bandits »
Soumya Basu · Branislav Kveton · Manzil Zaheer · Csaba Szepesvari -
2021 Poster: On the Convergence and Sample Efficiency of Variance-Reduced Policy Gradient Method »
Junyu Zhang · Chengzhuo Ni · zheng Yu · Csaba Szepesvari · Mengdi Wang -
2021 Poster: Understanding the Effect of Stochasticity in Policy Optimization »
Jincheng Mei · Bo Dai · Chenjun Xiao · Csaba Szepesvari · Dale Schuurmans -
2021 Poster: On the Role of Optimization in Double Descent: A Least Squares Study »
Ilja Kuzborskij · Csaba Szepesvari · Omar Rivasplata · Amal Rannen-Triki · Razvan Pascanu -
2020 Poster: Generalized Leverage Score Sampling for Neural Networks »
Jason Lee · Ruoqi Shen · Zhao Song · Mengdi Wang · zheng Yu -
2020 Poster: High-Dimensional Sparse Linear Bandits »
Botao Hao · Tor Lattimore · Mengdi Wang -
2020 Poster: Model Selection in Contextual Stochastic Bandit Problems »
Aldo Pacchiano · My Phan · Yasin Abbasi Yadkori · Anup Rao · Julian Zimmert · Tor Lattimore · Csaba Szepesvari -
2020 Poster: ImpatientCapsAndRuns: Approximately Optimal Algorithm Configuration from an Infinite Pool »
Gellert Weisz · András György · Wei-I Lin · Devon Graham · Kevin Leyton-Brown · Csaba Szepesvari · Brendan Lucier -
2020 Poster: Differentiable Meta-Learning of Bandit Policies »
Craig Boutilier · Chih-wei Hsu · Branislav Kveton · Martin Mladenov · Csaba Szepesvari · Manzil Zaheer -
2020 Poster: PAC-Bayes Analysis Beyond the Usual Bounds »
Omar Rivasplata · Ilja Kuzborskij · Csaba Szepesvari · John Shawe-Taylor -
2020 Poster: Escaping the Gravitational Pull of Softmax »
Jincheng Mei · Chenjun Xiao · Bo Dai · Lihong Li · Csaba Szepesvari · Dale Schuurmans -
2020 Poster: Online Algorithm for Unsupervised Sequential Selection with Contextual Information »
Arun Verma · Manjesh Kumar Hanawal · Csaba Szepesvari · Venkatesh Saligrama -
2020 Poster: Efficient Planning in Large MDPs with Weak Linear Function Approximation »
Roshan Shariff · Csaba Szepesvari -
2020 Oral: Escaping the Gravitational Pull of Softmax »
Jincheng Mei · Chenjun Xiao · Bo Dai · Lihong Li · Csaba Szepesvari · Dale Schuurmans -
2020 Poster: CoinDICE: Off-Policy Confidence Interval Estimation »
Bo Dai · Ofir Nachum · Yinlam Chow · Lihong Li · Csaba Szepesvari · Dale Schuurmans -
2020 Poster: On Function Approximation in Reinforcement Learning: Optimism in the Face of Large State Spaces »
Zhuoran Yang · Chi Jin · Zhaoran Wang · Mengdi Wang · Michael Jordan -
2020 Spotlight: CoinDICE: Off-Policy Confidence Interval Estimation »
Bo Dai · Ofir Nachum · Yinlam Chow · Lihong Li · Csaba Szepesvari · Dale Schuurmans -
2019 : Poster and Coffee Break 2 »
Karol Hausman · Kefan Dong · Ken Goldberg · Lihong Li · Lin Yang · Lingxiao Wang · Lior Shani · Liwei Wang · Loren Amdahl-Culleton · Lucas Cassano · Marc Dymetman · Marc Bellemare · Marcin Tomczak · Margarita Castro · Marius Kloft · Marius-Constantin Dinu · Markus Holzleitner · Martha White · Mengdi Wang · Michael Jordan · Mihailo Jovanovic · Ming Yu · Minshuo Chen · Moonkyung Ryu · Muhammad Zaheer · Naman Agarwal · Nan Jiang · Niao He · Nikolaus Yasui · Nikos Karampatziakis · Nino Vieillard · Ofir Nachum · Olivier Pietquin · Ozan Sener · Pan Xu · Parameswaran Kamalaruban · Paul Mineiro · Paul Rolland · Philip Amortila · Pierre-Luc Bacon · Prakash Panangaden · Qi Cai · Qiang Liu · Quanquan Gu · Raihan Seraj · Richard Sutton · Rick Valenzano · Robert Dadashi · Rodrigo Toro Icarte · Roshan Shariff · Roy Fox · Ruosong Wang · Saeed Ghadimi · Samuel Sokota · Sean Sinclair · Sepp Hochreiter · Sergey Levine · Sergio Valcarcel Macua · Sham Kakade · Shangtong Zhang · Sheila McIlraith · Shie Mannor · Shimon Whiteson · Shuai Li · Shuang Qiu · Wai Lok Li · Siddhartha Banerjee · Sitao Luan · Tamer Basar · Thinh Doan · Tianhe Yu · Tianyi Liu · Tom Zahavy · Toryn Klassen · Tuo Zhao · Vicenç Gómez · Vincent Liu · Volkan Cevher · Wesley Suttle · Xiao-Wen Chang · Xiaohan Wei · Xiaotong Liu · Xingguo Li · Xinyi Chen · Xingyou Song · Yao Liu · YiDing Jiang · Yihao Feng · Yilun Du · Yinlam Chow · Yinyu Ye · Yishay Mansour · · Yonathan Efroni · Yongxin Chen · Yuanhao Wang · Bo Dai · Chen-Yu Wei · Harsh Shrivastava · Hongyang Zhang · Qinqing Zheng · SIDDHARTHA SATPATHI · Xueqing Liu · Andreu Vall -
2019 : Poster Spotlight 2 »
Aaron Sidford · Mengdi Wang · Lin Yang · Yinyu Ye · Zuyue Fu · Zhuoran Yang · Yongxin Chen · Zhaoran Wang · Ofir Nachum · Bo Dai · Ilya Kostrikov · Dale Schuurmans · Ziyang Tang · Yihao Feng · Lihong Li · Denny Zhou · Qiang Liu · Rodrigo Toro Icarte · Ethan Waldie · Toryn Klassen · Rick Valenzano · Margarita Castro · Simon Du · Sham Kakade · Ruosong Wang · Minshuo Chen · Tianyi Liu · Xingguo Li · Zhaoran Wang · Tuo Zhao · Philip Amortila · Doina Precup · Prakash Panangaden · Marc Bellemare -
2019 : Poster and Coffee Break 1 »
Aaron Sidford · Aditya Mahajan · Alejandro Ribeiro · Alex Lewandowski · Ali H Sayed · Ambuj Tewari · Angelika Steger · Anima Anandkumar · Asier Mujika · Hilbert J Kappen · Bolei Zhou · Byron Boots · Chelsea Finn · Chen-Yu Wei · Chi Jin · Ching-An Cheng · Christina Yu · Clement Gehring · Craig Boutilier · Dahua Lin · Daniel McNamee · Daniel Russo · David Brandfonbrener · Denny Zhou · Devesh Jha · Diego Romeres · Doina Precup · Dominik Thalmeier · Eduard Gorbunov · Elad Hazan · Elena Smirnova · Elvis Dohmatob · Emma Brunskill · Enrique Munoz de Cote · Ethan Waldie · Florian Meier · Florian Schaefer · Ge Liu · Gergely Neu · Haim Kaplan · Hao Sun · Hengshuai Yao · Jalaj Bhandari · James A Preiss · Jayakumar Subramanian · Jiajin Li · Jieping Ye · Jimmy Smith · Joan Bas Serrano · Joan Bruna · John Langford · Jonathan Lee · Jose A. Arjona-Medina · Kaiqing Zhang · Karan Singh · Yuping Luo · Zafarali Ahmed · Zaiwei Chen · Zhaoran Wang · Zhizhong Li · Zhuoran Yang · Ziping Xu · Ziyang Tang · Yi Mao · David Brandfonbrener · Shirli Di-Castro · Riashat Islam · Zuyue Fu · Abhishek Naik · Saurabh Kumar · Benjamin Petit · Angeliki Kamoutsi · Simone Totaro · Arvind Raghunathan · Rui Wu · Donghwan Lee · Dongsheng Ding · Alec Koppel · Hao Sun · Christian Tjandraatmadja · Mahdi Karami · Jincheng Mei · Chenjun Xiao · Junfeng Wen · Zichen Zhang · Ross Goroshin · Mohammad Pezeshki · Jiaqi Zhai · Philip Amortila · Shuo Huang · Mariya Vasileva · El houcine Bergou · Adel Ahmadyan · Haoran Sun · Sheng Zhang · Lukas Gruber · Yuanhao Wang · Tetiana Parshakova -
2019 : Unsupervised State Embedding and Aggregation towards Scalable Reinforcement Learning »
Mengdi Wang -
2019 Poster: Think out of the "Box": Generically-Constrained Asynchronous Composite Optimization and Hedging »
Pooria Joulani · András György · Csaba Szepesvari -
2019 Poster: Detecting Overfitting via Adversarial Examples »
Roman Werpachowski · András György · Csaba Szepesvari -
2019 Poster: A Stochastic Composite Gradient Method with Incremental Variance Reduction »
Junyu Zhang · Lin Xiao -
2019 Poster: State Aggregation Learning from Markov Transition Data »
Yaqi Duan · Zheng Tracy Ke · Mengdi Wang -
2019 Poster: Learning low-dimensional state embeddings and metastable clusters from time series data »
Yifan Sun · Yaqi Duan · Hao Gong · Mengdi Wang -
2018 : Datasets and Benchmarks for Causal Learning »
Csaba Szepesvari · Isabelle Guyon · Nicolai Meinshausen · David Blei · Elias Bareinboim · Bernhard Schölkopf · Pietro Perona -
2018 : Model-free vs. Model-based Learning in a Causal World: Some Stories from Online Learning to Rank »
Csaba Szepesvari -
2018 Poster: Dimensionality Reduction for Stationary Time Series via Stochastic Nonconvex Optimization »
Minshuo Chen · Lin Yang · Mengdi Wang · Tuo Zhao -
2018 Poster: TopRank: A practical algorithm for online stochastic ranking »
Tor Lattimore · Branislav Kveton · Shuai Li · Csaba Szepesvari -
2018 Poster: Near-Optimal Time and Sample Complexities for Solving Markov Decision Processes with a Generative Model »
Aaron Sidford · Mengdi Wang · Xian Wu · Lin Yang · Yinyu Ye -
2018 Poster: PAC-Bayes bounds for stable algorithms with instance-dependent priors »
Omar Rivasplata · Emilio Parrado-Hernandez · John Shawe-Taylor · Shiliang Sun · Csaba Szepesvari -
2017 Poster: Multi-view Matrix Factorization for Linear Dynamical System Estimation »
Mahdi Karami · Martha White · Dale Schuurmans · Csaba Szepesvari -
2017 Poster: Diffusion Approximations for Online Principal Component Estimation and Global Convergence »
Chris Junchi Li · Mengdi Wang · Tong Zhang -
2017 Oral: Diffusion Approximations for Online Principal Component Estimation and Global Convergence »
Chris Junchi Li · Mengdi Wang · Tong Zhang -
2016 Poster: Accelerating Stochastic Composition Optimization »
Mengdi Wang · Ji Liu · Ethan Fang -
2016 Poster: Following the Leader and Fast Rates in Linear Prediction: Curved Constraint Sets and Other Regularities »
Ruitong Huang · Tor Lattimore · András György · Csaba Szepesvari -
2016 Poster: SDP Relaxation with Randomized Rounding for Energy Disaggregation »
Kiarash Shaloudegi · András György · Csaba Szepesvari · Wilsun Xu -
2016 Oral: SDP Relaxation with Randomized Rounding for Energy Disaggregation »
Kiarash Shaloudegi · András György · Csaba Szepesvari · Wilsun Xu -
2015 : Confidence intervals for the mixing time of a reversible Markov chain from a single sample path »
Csaba Szepesvari -
2015 Poster: Online Learning with Gaussian Payoffs and Side Observations »
Yifan Wu · András György · Csaba Szepesvari -
2015 Poster: Mixing Time Estimation in Reversible Markov Chains from a Single Sample Path »
Daniel Hsu · Aryeh Kontorovich · Csaba Szepesvari -
2015 Poster: Linear Multi-Resource Allocation with Semi-Bandit Feedback »
Tor Lattimore · Yacov Crammer · Csaba Szepesvari -
2015 Poster: Combinatorial Cascading Bandits »
Branislav Kveton · Zheng Wen · Azin Ashkan · Csaba Szepesvari -
2014 Workshop: Novel Trends and Applications in Reinforcement Learning »
Csaba Szepesvari · Marc Deisenroth · Sergey Levine · Pedro Ortega · Brian Ziebart · Emma Brunskill · Naftali Tishby · Gerhard Neumann · Daniel Lee · Sridhar Mahadevan · Pieter Abbeel · David Silver · Vicenç Gómez -
2014 Poster: Universal Option Models »
hengshuai yao · Csaba Szepesvari · Richard Sutton · Joseph Modayil · Shalabh Bhatnagar -
2013 Poster: Online Learning with Costly Features and Labels »
Navid Zolghadr · Gábor Bartók · Russell Greiner · András György · Csaba Szepesvari -
2013 Poster: Online Learning in Markov Decision Processes with Adversarially Chosen Transition Probability Distributions »
Yasin Abbasi Yadkori · Peter Bartlett · Varun Kanade · Yevgeny Seldin · Csaba Szepesvari -
2012 Session: Oral Session 6 »
Csaba Szepesvari -
2012 Poster: Deep Representations and Codes for Image Auto-Annotation »
Jamie Kiros · Csaba Szepesvari -
2011 Poster: Improved Algorithms for Linear Stochastic Bandits »
Yasin Abbasi Yadkori · David Pal · Csaba Szepesvari -
2011 Spotlight: Improved Algorithms for Linear Stochastic Bandits »
Yasin Abbasi Yadkori · David Pal · Csaba Szepesvari -
2010 Spotlight: Online Markov Decision Processes under Bandit Feedback »
Gergely Neu · András György · András Antos · Csaba Szepesvari -
2010 Poster: Online Markov Decision Processes under Bandit Feedback »
Gergely Neu · András György · Csaba Szepesvari · András Antos -
2010 Poster: Estimation of Renyi Entropy and Mutual Information Based on Generalized Nearest-Neighbor Graphs »
David Pal · Barnabas Poczos · Csaba Szepesvari -
2010 Poster: Parametric Bandits: The Generalized Linear Case »
Sarah Filippi · Olivier Cappé · Aurélien Garivier · Csaba Szepesvari -
2010 Poster: Error Propagation for Approximate Policy and Value Iteration »
Amir-massoud Farahmand · Remi Munos · Csaba Szepesvari -
2009 Poster: Multi-Step Dyna Planning for Policy Evaluation and Control »
Hengshuai Yao · Richard Sutton · Shalabh Bhatnagar · Dongcui Diao · Csaba Szepesvari -
2009 Poster: A General Projection Property for Distribution Families »
Yao-Liang Yu · Yuxi Li · Dale Schuurmans · Csaba Szepesvari -
2009 Poster: Convergent Temporal-Difference Learning with Arbitrary Smooth Function Approximation »
Hamid R Maei · Csaba Szepesvari · Shalabh Batnaghar · Doina Precup · David Silver · Richard Sutton -
2009 Spotlight: Convergent Temporal-Difference Learning with Arbitrary Smooth Function Approximation »
Hamid R Maei · Csaba Szepesvari · Shalabh Batnaghar · Doina Precup · David Silver · Richard Sutton -
2008 Poster: Online Optimization in X-Armed Bandits »
Sebastien Bubeck · Remi Munos · Gilles Stoltz · Csaba Szepesvari -
2008 Poster: Regularized Policy Iteration »
Amir-massoud Farahmand · Mohammad Ghavamzadeh · Csaba Szepesvari · Shie Mannor -
2008 Poster: A Convergent O(n) Temporal-difference Algorithm for Off-policy Learning with Linear Function Approxi »
Richard Sutton · Csaba Szepesvari · Hamid R Maei -
2007 Poster: Fitted Q-iteration in continuous action-space MDPs »
Remi Munos · András Antos · Csaba Szepesvari