Offline reinforcement learning (RL) suffers from extrapolation errors induced by out-of-distribution (OOD) actions. To address this, offline RL algorithms typically impose constraints on action selection, which can be systematically categorized into density, support, and sample constraints. However, we show that each category has inherent limitations: density and sample constraints tend to be overly conservative in many scenarios, while the support constraint, though least restrictive, faces challenges in accurately modeling the behavior policy. To overcome these limitations, we propose a new neighborhood constraint that restricts action selection in the Bellman target to the union of neighborhoods of dataset actions. Theoretically, the constraint not only bounds extrapolation errors and distribution shift under certain conditions, but also approximates the support constraint without requiring behavior policy modeling. Moreover, it retains substantial flexibility and enables pointwise conservatism by adapting the neighborhood radius for each data point. In practice, we employ data quality as the adaptation criterion and design an adaptive neighborhood constraint. Building on an efficient bilevel optimization framework, we develop a simple yet effective algorithm, Adaptive Neighborhood-constrained Q learning (ANQ), to perform Q learning with target actions satisfying this constraint. Empirically, ANQ achieves state-of-the-art performance on standard offline RL benchmarks and exhibits strong robustness in scenarios with noisy or limited data.
翻译:离线强化学习(RL)因分布外(OOD)动作引起的推断误差而面临挑战。为解决此问题,离线RL算法通常对动作选择施加约束,这些约束可系统性地分为密度约束、支撑约束和样本约束。然而,我们证明每类约束均存在固有局限:密度与样本约束在许多场景中往往过于保守,而支撑约束虽限制最少,却在准确建模行为策略方面面临困难。为克服这些局限,我们提出一种新的邻域约束方法,将贝尔曼目标中的动作选择限制在数据集动作邻域的并集内。理论上,该约束不仅能在特定条件下限制推断误差与分布偏移,还能在不需行为策略建模的情况下近似支撑约束。此外,它保留了充分的灵活性,并通过为每个数据点自适应调整邻域半径,实现了逐点保守性。实践中,我们采用数据质量作为自适应准则,设计了自适应邻域约束。基于高效的双层优化框架,我们开发了一种简洁而有效的算法——自适应邻域约束Q学习(ANQ),以执行满足该约束的目标动作Q学习。实证表明,ANQ在标准离线RL基准测试中取得了最先进的性能,并在含噪声或数据有限的场景中表现出强大的鲁棒性。