学习以最佳应对政策迭接方式玩无新闻外交 (Learning to Play No-Press Diplomacy with Best Response Policy Iteration)

Thomas Anthony,Tom Eccles,Andrea Tacchetti,János Kramár,Ian Gemp,Thomas C. Hudson,Nicolas Porcel,Marc Lanctot,Julien Pérolat,Richard Everett,Roman Werpachowski,Satinder Singh,Thore Graepel,Yoram Bachrach

Recent advances in deep reinforcement learning (RL) have led to considerable progress in many 2-player zero-sum games, such as Go, Poker and Starcraft. The purely adversarial nature of such games allows for conceptually simple and principled application of RL methods. However real-world settings are many-agent, and agent interactions are complex mixtures of common-interest and competitive aspects. We consider Diplomacy, a 7-player board game designed to accentuate dilemmas resulting from many-agent interactions. It also features a large combinatorial action space and simultaneous moves, which are challenging for RL algorithms. We propose a simple yet effective approximate best response operator, designed to handle large combinatorial action spaces and simultaneous moves. We also introduce a family of policy iteration methods that approximate fictitious play. With these methods, we successfully apply RL to Diplomacy: we show that our agents convincingly outperform the previous state-of-the-art, and game theoretic equilibrium analysis shows that the new process yields consistent improvements.

翻译：在深层强化学习(RL)方面最近的进展导致许多玩家零和游戏(Go、Poker和Starcraft等)取得了相当大的进展。这种游戏的纯粹对抗性性质使得在概念上简单而有原则地应用RL方法。然而,现实世界的设置是多方面的,代理相互作用是共同利益和竞争方面的复杂组合。我们认为,外交是一个七人游戏板游戏,旨在加重许多试剂相互作用造成的困境。它还具有大型组合动作空间和同步动作,这对RL算法具有挑战性。我们建议了一个简单而有效的最佳反应操作器,旨在处理大型组合动作空间和同步动作。我们还引入了一套政策推介方法,近似于虚构游戏。我们成功地将RL应用于外交:我们显示,我们的代理器令人信服地超越了先前的状态,游戏平衡分析显示,新的过程产生了一致的改进。