We present BLINQ, a new model-based algorithm that learns the Whittle indices of an indexable, communicating and unichain Markov Decision Process (MDP). Our approach relies on building an empirical estimate of the MDP and then computing its Whittle indices using an extended version of a state-of-the-art existing algorithm. We provide a proof of convergence to the Whittle indices we want to learn as well as a bound on the time needed to learn them with arbitrary precision. Moreover, we investigate its computational complexity. Our numerical experiments suggest that BLINQ significantly outperforms existing Q-learning approaches in terms of the number of samples needed to get an accurate approximation. In addition, it has a total computational cost even lower than Q-learning for any reasonably high number of samples. These observations persist even when the Q-learning algorithms are speeded up using pre-trained neural networks to predict Q-values.
翻译:本文提出BLINQ,一种新的基于模型的算法,用于学习可索引、可通信且单链马尔可夫决策过程(MDP)的Whittle指数。我们的方法依赖于构建MDP的经验估计,然后使用现有先进算法的扩展版本计算其Whittle指数。我们提供了收敛到待学习Whittle指数的证明,以及以任意精度学习这些指数所需时间的界限。此外,我们研究了其计算复杂度。数值实验表明,在获得精确近似所需样本数方面,BLINQ显著优于现有的Q学习方法。同时,对于任何合理的大样本量,其总计算成本甚至低于Q学习。即使Q学习算法通过使用预训练神经网络预测Q值进行加速,这些观察结果仍然成立。