Timezone: »

Adversarial Policies Beat Professional-Level Go AIs
Tony Wang · Adam Gleave · Nora Belrose · Tom Tseng · Michael Dennis · Yawen Duan · Viktor Pogrebniak · Joseph Miller · Sergey Levine · Stuart Russell

We attack the state-of-the-art Go-playing AI system, KataGo, by training an adversarial policy that plays against a frozen KataGo victim. Our attack achieves a >99\% win-rate against KataGo without search, and a >50% win-rate when KataGo uses enough search to be near-superhuman. To the best of our knowledge, this is the first successful end-to-end attack against a Go AI playing at the level of a top human professional. Notably, the adversary does not win by learning to play Go better than KataGo---in fact, the adversary is easily beaten by human amateurs. Instead, the adversary wins by tricking KataGo into ending the game prematurely at a point that is favorable to the adversary. Our results demonstrate that even professional-level AI systems may harbor surprising failure modes. Our results demonstrate that AI systems which are normally superhuman may still be less robust than humans. Example games are available at https://goattack.alignmentfund.org/

Author Information

Tony Wang (MIT)
Adam Gleave (UC Berkeley)
Nora Belrose (Fund for Alignment Research)
Tom Tseng (FAR AI)
Michael Dennis (University of California Berkeley)

Michael Dennis is a 5th year grad student at the Center for Human-Compatible AI. With a background in theoretical computer science, he is working to close the gap between decision theoretic and game theoretic recommendations and the current state of the art approaches to robust RL and multi-agent RL. The overall aim of this work is to ensure that our systems behave in a way that is robustly beneficial. In the single agent setting, this means making decisions and managing risk in the way the designer intends. In the multi-agent setting, this means ensuring that the concerns of the designer and those of others in the society are fairly and justly negotiated to the benefit of all involved.

Yawen Duan (University of Cambridge)
Viktor Pogrebniak (FAR AI)
Joseph Miller (FAR AI)
Sergey Levine (UC Berkeley)
Stuart Russell (UC Berkeley)

More from the Same Authors