The Evolutionary Dynamics of Soft-Max PolicyGradient in Multi-Agent Settings
Martino Bernasconi-de-Luca · Federico Cacciamani · Simone Fioravanti · Nicola Gatti · Francesco Trovò

Policy gradient is one of the most famous algorithms in reinforcement learning. In this paper, we derive the mean dynamics of the soft-max policy gradient algorithm in multi-agent settings by resorting to evolutionary game theory tools. Studying its dynamics is crucial to understand the algorithm's weaknesses and suggest how to recover from them. Unlike most multi-agent reinforcement learning algorithms, whose mean dynamics are slight variants of the replicator dynamics not affecting the properties of the original dynamics, the soft-max policy gradient dynamics present a different structure. However, they preserve a close connection with replicator dynamics, being a replicator dynamics applied to a non-linear transformation of the fitness function. We separately analyze the dynamics when learning the best response from the cases of single- and multi-population games. In particular, we show that the soft-max policy gradient dynamics always converge to the best response. However, differently from the replicator dynamics, they always suffer from a non-empty space of bad initializations from which the convergence of the dynamics to the best response is not monotonic. Furthermore, in single- and multi-population games, we show that the soft-max policy gradient dynamics satisfy a weaker set of properties than those satisfied by replicator dynamics.

Author Information

Martino Bernasconi-de-Luca (Politecnico di Milano)
Federico Cacciamani (Politecnico di Milano)
Simone Fioravanti (Gran Sasso Science Institute (GSSI))
Nicola Gatti (Politecnico di Milano)
Francesco Trovò (Politecnico di Milano)

