No-regret learning dynamics play a central role in game theory, enabling decentralized convergence to equilibrium for concepts such as Coarse Correlated Equilibrium (CCE) or Correlated Equilibrium (CE). In this work, we improve the convergence rate to CCE in general-sum Markov games, reducing it from the previously best-known rate of $\mathcal{O}(\log^5 T / T)$ to a sharper $\mathcal{O}(\log T / T)$. This matches the best known convergence rate for CE in terms of $T$, number of iterations, while also improving the dependence on the action set size from polynomial to polylogarithmic-yielding exponential gains in high-dimensional settings. Our approach builds on recent advances in adaptive step-size techniques for no-regret algorithms in normal-form games, and extends them to the Markovian setting via a stage-wise scheme that adjusts learning rates based on real-time feedback. We frame policy updates as an instance of Optimistic Follow-the-Regularized-Leader (OFTRL), customized for value-iteration-based learning. The resulting self-play algorithm achieves, to our knowledge, the fastest known convergence rate to CCE in Markov games.
翻译:无悔学习动态在博弈论中扮演着核心角色,能够实现去中心化收敛至均衡概念,如粗相关均衡(CCE)或相关均衡(CE)。在本研究中,我们提升了一般和马尔可夫博弈中收敛至CCE的速率,将其从先前已知的最佳速率 $\mathcal{O}(\log^5 T / T)$ 改进至更优的 $\mathcal{O}(\log T / T)$。这一速率在迭代次数 $T$ 方面匹配了CE的最佳已知收敛速率,同时将动作集大小的依赖从多项式改进至多对数级别——在高维场景中实现了指数级增益。我们的方法基于近期在正规形式博弈中自适应步长无悔算法方面的进展,并通过一种基于实时反馈调整学习率的阶段式方案,将其扩展至马尔可夫场景。我们将策略更新构建为乐观跟随正则化领导者(OFTRL)的一个实例,并针对基于价值迭代的学习进行定制。所得的自博弈算法实现了我们已知的马尔可夫博弈中收敛至CCE的最快速率。