Leader-follower general-sum stochastic games (LF-GSSGs) model sequential decision-making under asymmetric commitment, where a leader commits to a policy and a follower best responds, yielding a strong Stackelberg equilibrium (SSE) with leader-favourable tie-breaking. This paper introduces a dynamic programming (DP) framework that applies Bellman recursion over credible sets-state abstractions formally representing all rational follower best responses under partial leader commitments-to compute SSEs. We first prove that any LF-GSSG admits a lossless reduction to a Markov decision process (MDP) over credible sets. We further establish that synthesising an optimal memoryless deterministic leader policy is NP-hard, motivating the development of ε-optimal DP algorithms with provable guarantees on leader exploitability. Experiments on standard mixed-motive benchmarks-including security games, resource allocation, and adversarial planning-demonstrate empirical gains in leader value and runtime scalability over state-of-the-art methods.
翻译:领导-跟随者一般和随机博弈(LF-GSSGs)建模了非对称承诺下的序贯决策过程,其中领导者承诺一项策略,跟随者以最优响应,从而产生一个具有领导者有利平局处理的强斯塔克尔伯格均衡(SSE)。本文引入了一个动态规划(DP)框架,该框架在可信集(一种状态抽象,形式化地表示在部分领导者承诺下所有理性跟随者最优响应)上应用贝尔曼递归来计算SSE。我们首先证明任何LF-GSSG都可以无损地约简为一个在可信集上的马尔可夫决策过程(MDP)。我们进一步确立了合成一个最优的无记忆确定性领导者策略是NP难的,这推动了开发具有可证明领导者可剥削性保证的ε最优DP算法。在标准混合动机基准测试(包括安全博弈、资源分配和对抗规划)上的实验表明,相较于最先进方法,在领导者价值和运行时可扩展性方面取得了经验性提升。