半监督政策强化学习 (Semi-Supervised Off Policy Reinforcement Learning)

Reinforcement learning (RL) has shown great success in estimating sequential treatment strategies which take into account patient heterogeneity. However, health-outcome information, which is used as the reward for reinforcement learning methods, is often not well coded but rather embedded in clinical notes. Extracting precise outcome information is a resource intensive task, so most of the available well-annotated cohorts are small. To address this issue, we propose a semi-supervised learning (SSL) approach that efficiently leverages a small sized labeled data with true outcome observed, and a large unlabeled data with outcome surrogates. In particular, we propose a semi-supervised, efficient approach to Q-learning and doubly robust off policy value estimation. Generalizing SSL to sequential treatment regimes brings interesting challenges: 1) Feature distribution for Q-learning is unknown as it includes previous outcomes. 2) The surrogate variables we leverage in the modified SSL framework are predictive of the outcome but not informative to the optimal policy or value function. We provide theoretical results for our Q-function and value function estimators to understand to what degree efficiency can be gained from SSL. Our method is at least as efficient as the supervised approach, and moreover safe as it robust to mis-specification of the imputation models.

翻译：强化学习(RL)在估计考虑到病人异质的连续治疗战略方面表现出极大的成功。然而,健康结果信息,作为强化学习方法的奖励,往往没有完善的编码,而是嵌入临床记录中。提取准确的结果信息是一项资源密集型任务,因此大多数现有的有良好说明的组群规模很小。为解决这一问题,我们建议采用半监督的学习方法,有效地利用小型标签数据,并观察到真实结果,以及大量无标签数据,并带有成果代孕。特别是,我们建议半监督的、有效的Q学习方法,并加倍强化政策价值估计。将SSL推广到顺序治疗制度带来了有趣的挑战:1) Q学习的特性分布并不为人所知,因为它包括以前的成果。(2) 我们在修改的SSL框架中利用的替代变量对结果进行预测,但对于最佳政策或价值功能却不甚丰富。我们为我们的Q-功能和价值功能提供了理论结果,我们提出了半监督的Q-学习方法,对Q-学习方法的精准性,对政策价值的估量值进行双重强健健健。将带来有趣的挑战:(1) QSLSL作为最安全的方法,对安全性效率的衡量。