Double Reinforcement Learning (DRL) enables efficient inference for policy values in nonparametric Markov decision processes (MDPs), but existing methods face two major obstacles: (1) they require stringent intertemporal overlap conditions on state trajectories, and (2) they rely on estimating high-dimensional occupancy density ratios. Motivated by problems in long-term causal inference, we extend DRL to a semiparametric setting and develop doubly robust, automatic estimators for general linear functionals of the Q-function in infinite-horizon, time-homogeneous MDPs. By imposing structure on the Q-function, we relax the overlap conditions required by nonparametric methods and obtain efficiency gains. The second obstacle--density-ratio estimation--typically requires computationally expensive and unstable min-max optimization. To address both challenges, we introduce superefficient nonparametric estimators whose limiting variance falls below the generalized Cramer-Rao bound. These estimators treat the Q-function as a one-dimensional summary of the state-action process, reducing high-dimensional overlap requirements to a single-dimensional condition. The procedure is simple to implement: estimate and calibrate the Q-function using fitted Q-iteration, then plug the result into the target functional, thereby avoiding density-ratio estimation altogether.
翻译:双重强化学习(DRL)能够在非参数马尔可夫决策过程中实现策略值的高效推断,但现有方法面临两大障碍:(1)需要满足状态轨迹的严格跨期重叠条件;(2)依赖高维占用密度比的估计。受长期因果推断问题的启发,我们将DRL扩展至半参数设定,并为无限时域、时间齐次MDP中Q函数的一般线性泛函开发了双重稳健的自动估计器。通过对Q函数施加结构约束,我们放宽了非参数方法所需的重叠条件,并获得了效率提升。第二个障碍——密度比估计——通常需要计算成本高昂且不稳定的极小极大优化。为应对这两项挑战,我们引入了超高效非参数估计器,其极限方差低于广义克拉默-拉奥下界。这些估计器将Q函数视为状态-动作过程的一维摘要,从而将高维重叠要求简化为单维条件。该方法的实现流程简洁:首先使用拟合Q迭代对Q函数进行估计与校准,随后将结果代入目标泛函,从而完全避免了密度比估计。