Timezone: »
A widely-used actor-critic reinforcement learning algorithm for continuous control, Deep Deterministic Policy Gradients (DDPG), suffers from the overestimation problem, which can negatively affect the performance. Although the state-of-the-art Twin Delayed Deep Deterministic Policy Gradient (TD3) algorithm mitigates the overestimation issue, it can lead to a large underestimation bias. In this paper, we propose to use the Boltzmann softmax operator for value function estimation in continuous control. We first theoretically analyze the softmax operator in continuous action space. Then, we uncover an important property of the softmax operator in actor-critic algorithms, i.e., it helps to smooth the optimization landscape, which sheds new light on the benefits of the operator. We also design two new algorithms, Softmax Deep Deterministic Policy Gradients (SD2) and Softmax Deep Double Deterministic Policy Gradients (SD3), by building the softmax operator upon single and double estimators, which can effectively improve the overestimation and underestimation bias. We conduct extensive experiments on challenging continuous control tasks, and results show that SD3 outperforms state-of-the-art methods.
Author Information
Ling Pan (Tsinghua University)
Qingpeng Cai (Alibaba Group)
Longbo Huang (IIIS, Tsinghua Univeristy)
More from the Same Authors
-
2021 : Plan Better Amid Conservatism: Offline Multi-Agent Reinforcement Learning with Actor Rectification »
Ling Pan · Longbo Huang · Tengyu Ma · Huazhe Xu -
2022 Poster: Provable Generalization of Overparameterized Meta-learning Trained with SGD »
Yu Huang · Yingbin Liang · Longbo Huang -
2022 : Why (and When) does Local SGD Generalize Better than SGD? »
Xinran Gu · Kaifeng Lyu · Longbo Huang · Sanjeev Arora -
2022 : Online Min-max Optimization: Nonconvexity, Nonstationarity, and Dynamic Regret »
Yu Huang · Yuan Cheng · Yingbin Liang · Longbo Huang -
2022 Spotlight: Lightning Talks 3B-2 »
Yu Huang · Tero Karras · Maxim Kodryan · Shiau Hong Lim · Shudong Huang · Ziyu Wang · Siqiao Xue · ILYAS MALIK · Ekaterina Lobacheva · Miika Aittala · Hongjie Wu · Yuhao Zhou · Yingbin Liang · Xiaoming Shi · Jun Zhu · Maksim Nakhodnov · Timo Aila · Yazhou Ren · James Zhang · Longbo Huang · Dmitry Vetrov · Ivor Tsang · Hongyuan Mei · Samuli Laine · Zenglin Xu · Wentao Feng · Jiancheng Lv -
2022 Spotlight: Provable Generalization of Overparameterized Meta-learning Trained with SGD »
Yu Huang · Yingbin Liang · Longbo Huang -
2021 Poster: The best of both worlds: stochastic and adversarial episodic MDPs with unknown transition »
Tiancheng Jin · Longbo Huang · Haipeng Luo -
2021 Poster: Multi-Agent Reinforcement Learning in Stochastic Networked Systems »
Yiheng Lin · Guannan Qu · Longbo Huang · Adam Wierman -
2021 Poster: Continuous Mean-Covariance Bandits »
Yihan Du · Siwei Wang · Zhixuan Fang · Longbo Huang -
2021 Poster: Fast Federated Learning in the Presence of Arbitrary Device Unavailability »
Xinran Gu · Kaixuan Huang · Jingzhao Zhang · Longbo Huang -
2021 Poster: What Makes Multi-Modal Learning Better than Single (Provably) »
Yu Huang · Chenzhuang Du · Zihui Xue · Xuanyao Chen · Hang Zhao · Longbo Huang -
2021 Poster: Regularized Softmax Deep Multi-Agent Q-Learning »
Ling Pan · Tabish Rashid · Bei Peng · Longbo Huang · Shimon Whiteson -
2021 Oral: The best of both worlds: stochastic and adversarial episodic MDPs with unknown transition »
Tiancheng Jin · Longbo Huang · Haipeng Luo -
2020 Poster: Restless-UCB, an Efficient and Low-complexity Algorithm for Online Restless Bandits »
Siwei Wang · Longbo Huang · John C. S. Lui -
2019 Poster: Double Quantization for Communication-Efficient Distributed Optimization »
Yue Yu · Jiaxiang Wu · Longbo Huang -
2018 Poster: Multi-armed Bandits with Compensation »
Siwei Wang · Longbo Huang