面向平均回报的政策学习(MDP) (Near Sample-Optimal Reduction-based Policy Learning for Average Reward MDP)

This work considers the sample complexity of obtaining an $\varepsilon$-optimal policy in an average reward Markov Decision Process (AMDP), given access to a generative model (simulator). When the ground-truth MDP is weakly communicating, we prove an upper bound of $\widetilde O(H \varepsilon^{-3} \ln \frac{1}{\delta})$ samples per state-action pair, where $H := sp(h^*)$ is the span of bias of any optimal policy, $\varepsilon$ is the accuracy and $\delta$ is the failure probability. This bound improves the best-known mixing-time-based approaches in [Jin & Sidford 2021], which assume the mixing-time of every deterministic policy is bounded. The core of our analysis is a proper reduction bound from AMDP problems to discounted MDP (DMDP) problems, which may be of independent interests since it allows the application of DMDP algorithms for AMDP in other settings. We complement our upper bound by proving a minimax lower bound of $\Omega(|\mathcal S| |\mathcal A| H \varepsilon^{-2} \ln \frac{1}{\delta})$ total samples, showing that a linear dependent on $H$ is necessary and that our upper bound matches the lower bound in all parameters of $(|\mathcal S|, |\mathcal A|, H, \ln \frac{1}{\delta})$ up to some logarithmic factors.

翻译：这项工作考虑了在平均奖励 Markov 决策过程中获取 $varepsilon $- 最佳政策( AMDP ) 的样本复杂性。在使用基因模型( 模拟器) 的情况下, 获得一个平均奖赏 Markov 决策过程( AMDP ) 中 $\ varepsilon $- 最佳政策( AMDP ) 的样本复杂性。当地底真相 MDP 的沟通不力时, 我们证明每条确定性政策的混合时间被捆绑了。我们的分析核心在于从 AMDP 问题到折扣 MDP ( DMDP ) 问题, 这可能是独立的利益所在, 因为它允许将 DMDP 的算法用于其他的 AMDP, $\ delta 约束值是美元。我们通过 =\\\\\\\\\ ma\ ma\ 最低约束性 A 来补充我们最著名的混合方法。我们的 A =\\\\\\\\\ lima lam lialalalal oral oral oral orgal log 。