部分覆盖的基于模型的悲观非线外强化学习 (Pessimistic Model-based Offline Reinforcement Learning under Partial Coverage)

from arxiv, We changed the title from the first version. This is a longer version of the article accepted in ICLR 2022. We added a new algorithm CPPO-LR where the constraint is given in a log-likelihood form and how to instantiate CPPO on (nonparametric) linear MDPs

We study model-based offline Reinforcement Learning with general function approximation without a full coverage assumption on the offline data distribution. We present an algorithm named Constrained Pessimistic Policy Optimization (CPPO)which leverages a general function class and uses a constraint over the model class to encode pessimism. Under the assumption that the ground truth model belongs to our function class (i.e., realizability in the function class), CPPO has a PAC guarantee with offline data only providing partial coverage, i.e., it can learn a policy that competes against any policy that is covered by the offline data. We then demonstrate that this algorithmic framework can be applied to many specialized Markov Decision Processes where additional structural assumptions can further refine the concept of partial coverage. Two notable examples are: (1) low-rank MDP with representation learning where the partial coverage condition is defined using a relative condition number measured by the unknown ground truth feature representation; (2) factored MDP where the partial coverage condition is defined using density ratio based concentrability coefficients associated with individual factors.

翻译：我们研究基于模型的离线强化学习,在离线数据分布上没有全面覆盖假设的通用功能近似值。我们提出了一个名为Constraced Pessimic Policy Policy Poptical(CPPO)的算法,该算法利用一般功能类别,对模型类别使用限制来将悲观概念编码。根据地面真相模型属于我们功能类别(即功能类的可变性)的假设,CPPO有一个PAC保证,离线数据只能提供部分覆盖,即它可以学习一种与离线数据所涵盖的任何政策竞争的政策。然后我们证明,这一算法框架可以适用于许多专门的Markov 决策程序,在这些程序中,额外的结构性假设可以进一步完善部分覆盖的概念。两个显著的例子有:(1) 低级别的MDP,其代表学习方式是使用未知的地面真相特征代表度测量的相对条件编号来确定部分覆盖条件;(2) 参数式MDP,其中部分覆盖条件的定义是使用与个别因素相关的密度比比系数来确定部分覆盖条件。