We study offline multi-agent reinforcement learning (RL), which aims to learn an optimal policy based on historical data for multiple agents. Unlike online RL, accidental overestimation errors arising from function approximation can cumulatively arise and affect future iterations in the offline RL. In this paper, we extend the pessimistic value iteration algorithm to the multi-agent setting: after obtaining the lower bound of the value function of each agent, we compute the optimistic policy by solving a general-sum matrix game with these lower bounds as the payoff matrices. Instead of finding the Nash Equilibrium of such a general-sum game, our algorithm solves for a Coarse Correlated Equilibrium (CCE) for the sake of computational efficiency. In the end, we prove an information-theoretic lower bound of the regret for any algorithm on an offline dataset in the two-agent zero-sum game. To the best of our knowledge, such a CCE scheme for pessimism-based algorithm has not appeared in the literature and can be of interest in its own right. We hope that our work can shed light on the future analysis of the equilibrium point of a multiagent Markov decision process given an offline dataset.
Yihang Chen (Peking University)
More from the Same Authors
2021 : Pessimistic Offline Reinforcement Learning with Multiple Agents »
2020 Poster: Sanity-Checking Pruning Methods: Random Tickets can Win the Jackpot »
Jingtong Su · Yihang Chen · Tianle Cai · Tianhao Wu · Ruiqi Gao · Liwei Wang · Jason Lee