Skip to yearly menu bar Skip to main content


Conservative Offline Policy Adaptation in Multi-Agent Games

Chengjie Wu · Pingzhong Tang · Jun Yang · Yujing Hu · Tangjie Lv · Changjie Fan · Chongjie Zhang

Great Hall & Hall B1+B2 (level 1) #1502
[ ]
[ Paper [ Slides [ Poster [ OpenReview
Tue 12 Dec 8:45 a.m. PST — 10:45 a.m. PST


Prior research on policy adaptation in multi-agent games has often relied on online interaction with the target agent in training, which can be expensive and impractical in real-world scenarios. Inspired by recent progress in offline reinforcement learn- ing, this paper studies offline policy adaptation, which aims to utilize the target agent’s behavior data to exploit its weakness or enable effective cooperation. We investigate its distinct challenges of distributional shift and risk-free deviation, and propose a novel learning objective, conservative offline adaptation, that optimizes the worst-case performance against any dataset consistent proxy models. We pro- pose an efficient algorithm called Constrained Self-Play (CSP) that incorporates dataset information into regularized policy learning. We prove that CSP learns a near-optimal risk-free offline adaptation policy upon convergence. Empirical results demonstrate that CSP outperforms non-conservative baselines in various environments, including Maze, predator-prey, MuJoCo, and Google Football.

Chat is not available.