Motivated by engineering applications such as resource allocation in networks and inventory systems, we consider average-reward Reinforcement Learning with unbounded state space and reward function. Recent works studied this problem in the actor-critic framework and established finite sample bounds assuming access to a critic with certain error guarantees. We complement their work by studying Temporal Difference (TD) learning with linear function approximation and establishing finite-time bounds with the optimal $\mathcal{O}\left(1/\epsilon^2\right)$ sample complexity. These results are obtained using the following general-purpose theorem for non-linear Stochastic Approximation (SA). Suppose that one constructs a Lyapunov function for a non-linear SA with certain drift condition. Then, our theorem establishes finite-time bounds when this SA is driven by unbounded Markovian noise under suitable conditions. It serves as a black box tool to generalize sample guarantees on SA from i.i.d. or martingale difference case to potentially unbounded Markovian noise. The generality and the mild assumption of the setup enables broad applicability of our theorem. We illustrate its power by studying two more systems: (i) We improve upon the finite-time bounds of $Q$-learning by tightening the error bounds and also allowing for a larger class of behavior policies. (ii) We establish the first ever finite-time bounds for distributed stochastic optimization of high-dimensional smooth strongly convex function using cyclic block coordinate descent.
翻译:受网络资源分配和库存系统等工程应用的启发,我们研究了具有无界状态空间和奖励函数的平均奖励强化学习。近期研究在演员-评论家框架下探讨了该问题,并在假设可获得具有特定误差保证的评论家时建立了有限样本界。我们通过研究基于线性函数逼近的时间差分学习,建立了具有最优$\mathcal{O}\left(1/\epsilon^2\right)$样本复杂度的有限时间界,从而对现有工作进行了补充。这些结果是通过以下针对非线性随机逼近的通用定理获得的:假设为具有特定漂移条件的非线性随机逼近构造了李雅普诺夫函数,则我们的定理在适当条件下建立了该随机逼近受无界马尔可夫噪声驱动时的有限时间界。该定理可作为黑箱工具,将随机逼近的样本保证从独立同分布或鞅差情形推广至潜在无界的马尔可夫噪声场景。该框架的普适性和温和假设确保了定理的广泛适用性。我们通过研究两个系统展示其效能:(i)通过收紧误差界并允许更广泛的行为策略类别,改进了$Q$学习的有限时间界;(ii)首次建立了使用循环块坐标下降法对高维光滑强凸函数进行分布式随机优化的有限时间界。