Timezone: »

Target Entropy Annealing for Discrete Soft Actor-Critic
Yaosheng Xu · Dailin Hu · Litian Liang · Stephen McAleer · Pieter Abbeel · Roy Fox

Soft Actor-Critic (SAC) is considered the state-of-the-art algorithm in continuous action space settings. It uses the maximum entropy framework for efficiency and stability, and applies a heuristic temperature Lagrange term to tune the temperature $\alpha$, which determines how "soft" the policy should be. It is counter-intuitive that empirical evidence shows SAC does not perform well in discrete domains. In this paper we investigate the possible explanations for this phenomenon and propose Target Entropy Scheduled SAC (TES-SAC), an annealing method for the target entropy parameter applied on SAC. Target entropy is a constant in the temperature Lagrange term and represents the target policy entropy in discrete SAC. We compare our method on Atari 2600 games with different constant target entropy SAC, and analyze on how our scheduling affects SAC.

#### Author Information

##### Pieter Abbeel (UC Berkeley & Covariant)

Pieter Abbeel is Professor and Director of the Robot Learning Lab at UC Berkeley [2008- ], Co-Director of the Berkeley AI Research (BAIR) Lab, Co-Founder of covariant.ai [2017- ], Co-Founder of Gradescope [2014- ], Advisor to OpenAI, Founding Faculty Partner AI@TheHouse venture fund, Advisor to many AI/Robotics start-ups. He works in machine learning and robotics. In particular his research focuses on making robots learn from people (apprenticeship learning), how to make robots learn through their own trial and error (reinforcement learning), and how to speed up skill acquisition through learning-to-learn (meta-learning). His robots have learned advanced helicopter aerobatics, knot-tying, basic assembly, organizing laundry, locomotion, and vision-based robotic manipulation. He has won numerous awards, including best paper awards at ICML, NIPS and ICRA, early career awards from NSF, Darpa, ONR, AFOSR, Sloan, TR35, IEEE, and the Presidential Early Career Award for Scientists and Engineers (PECASE). Pieter's work is frequently featured in the popular press, including New York Times, BBC, Bloomberg, Wall Street Journal, Wired, Forbes, Tech Review, NPR.

##### Roy Fox (UC Irvine)

[Roy Fox](http://roydfox.com/) is a postdoc at UC Berkeley working with [Ion Stoica](http://people.eecs.berkeley.edu/~istoica/) in the Real-Time Intelligent Secure Explainable lab ([RISELab](https://rise.cs.berkeley.edu/)), and with [Ken Goldberg](http://goldberg.berkeley.edu/) in the Laboratory for Automation Science and Engineering ([AUTOLAB](http://autolab.berkeley.edu/)). His research interests include reinforcement learning, dynamical systems, information theory, automation, and the connections between these fields. His current research focuses on automatic discovery of hierarchical control structures in deep reinforcement learning and in imitation learning of robotic tasks. Roy holds a MSc in Computer Science from the [Technion](http://www.cs.technion.ac.il/), under the supervision of [Moshe Tennenholtz](http://iew3.technion.ac.il/Home/Users/Moshet.phtml), and a PhD in Computer Science from the [Hebrew University](http://www.cs.huji.ac.il/), under the supervision of [Naftali Tishby](http://www.cs.huji.ac.il/~tishby/). He was an exchange PhD student with [Larry Abbott](http://www.cs.huji.ac.il/~tishby/) and [Liam Paninski](http://www.stat.columbia.edu/~liam/) at the [Center for Theoretical Neuroscience](http://www.neurotheory.columbia.edu/) at Columbia University, and a research intern at Microsoft Research.