Markov games and robust MDPs are closely related models that involve computing a pair of saddle point policies. As part of the long-standing effort to develop efficient algorithms for these models, the Filar-Tolwinski (FT) algorithm has shown considerable promise. As our first contribution, we demonstrate that FT may fail to converge to a saddle point and may loop indefinitely, even in small games. This observation contradicts the proof of FT's convergence to a saddle point in the original paper. As our second contribution, we propose Residual Conditioned Policy Iteration (RCPI). RCPI builds on FT, but is guaranteed to converge to a saddle point. Our numerical results show that RCPI outperforms other convergent algorithms by several orders of magnitude.
翻译:马尔可夫博弈与鲁棒马尔可夫决策过程(MDP)是密切相关的模型,其核心在于计算一对鞍点策略。作为长期致力于为这些模型开发高效算法的一部分,Filar-Tolwinski(FT)算法已展现出显著潜力。作为我们的第一项贡献,我们证明FT算法可能无法收敛至鞍点,甚至可能在小型博弈中无限循环。这一观察结果与原始论文中FT收敛至鞍点的证明相矛盾。作为第二项贡献,我们提出了残差条件策略迭代(RCPI)。RCPI基于FT算法构建,但能保证收敛至鞍点。数值实验表明,RCPI的性能优于其他收敛算法数个数量级。