We consider a general infinite horizon Heterogeneous Restless multi-armed Bandit (RMAB). Heterogeneity is a fundamental problem for many real-world systems largely because it resists many concentration arguments. In this paper, we assume that each of the $N$ arms can have different model parameters. We show that, under a mild assumption of uniform ergodicity, a natural finite-horizon LP-update policy with randomized rounding, that was originally proposed for the homogeneous case, achieves an $O(\log N\sqrt{1/N})$ optimality gap in infinite time average reward problems for fully heterogeneous RMABs. In doing so, we show results that provide strong theoretical guarantees on a well-known algorithm that works very well in practice. The LP-update policy is a model predictive approach that computes a decision at time $t$ by planing over a time-horizon $\{t\dots t+τ\}$. Our simulation section demonstrates that our algorithm works extremely well even when $τ$ is very small and set to $5$, which makes it computationally efficient. Our theoretical results draw on techniques from the model predictive control literature by invoking the concept of \emph{dissipativity} and generalize quite easily to the more general weakly coupled heterogeneous Markov Decision Process setting. In addition, we draw a parallel between our own policy and the LP-index policy by showing that the LP-index policy corresponds to $τ=1$. We describe where the latter's shortcomings arise from and how under our mild assumption we are able to address these shortcomings. The proof of our main theorem answers an open problem posed by (Brown et al 2020), paving the way for several new questions on the LP-update policies.
翻译:我们考虑一个通用的无限时域异构不安定多臂赌博机(RMAB)问题。异构性是许多现实世界系统的根本问题,主要因为它抵抗许多集中性论证。在本文中,我们假设$N$个臂中的每一个可以具有不同的模型参数。我们证明,在一致遍历性的温和假设下,一个最初为同质情况提出的、带有随机舍入的自然有限时域LP更新策略,在完全异构RMAB的无限时间平均奖励问题中实现了$O(\log N\sqrt{1/N})$的最优性差距。通过这样做,我们展示了为一种在实践中表现良好的知名算法提供强大理论保证的结果。LP更新策略是一种模型预测方法,它通过在时间范围$\{t\dots t+τ\}$上进行规划来计算时间$t$的决策。我们的模拟部分表明,即使$τ$非常小且设置为$5$,我们的算法仍然表现极佳,这使其计算效率高。我们的理论结果借鉴了模型预测控制文献中的技术,通过引入\\emph{耗散性}的概念,并且可以很容易地推广到更一般的弱耦合异构马尔可夫决策过程设置。此外,我们通过展示LP索引策略对应于$τ=1$,将我们自己的策略与LP索引策略进行了类比。我们描述了后者的缺点从何而来,以及在我们温和的假设下我们如何能够解决这些缺点。我们主要定理的证明回答了(Brown等人2020)提出的一个开放问题,为LP更新策略的几个新问题铺平了道路。