The Robust Regularized Markov Decision Process (RRMDP) is proposed to learn policies robust to dynamics shifts by adding regularization to the transition dynamics in the value function. Existing methods mostly use unstructured regularization, potentially leading to conservative policies under unrealistic transitions. To address this limitation, we propose a novel framework, the $d$-rectangular linear RRMDP ($d$-RRMDP), which introduces latent structures into both transition kernels and regularization. We focus on offline reinforcement learning, where an agent learns policies from a precollected dataset in the nominal environment. We develop the Robust Regularized Pessimistic Value Iteration (R2PVI) algorithm that employs linear function approximation for robust policy learning in $d$-RRMDPs with $f$-divergence based regularization terms on transition kernels. We provide instance-dependent upper bounds on the suboptimality gap of R2PVI policies, demonstrating that these bounds are influenced by how well the dataset covers state-action spaces visited by the optimal robust policy under robustly admissible transitions. We establish information-theoretic lower bounds to verify that our algorithm is near-optimal. Finally, numerical experiments validate that R2PVI learns robust policies and exhibits superior computational efficiency compared to baseline methods.
翻译:鲁棒正则化马尔可夫决策过程(RRMDP)通过在价值函数中对转移动态添加正则化,旨在学习对动态变化具有鲁棒性的策略。现有方法多采用无结构正则化,可能导致策略在面对不切实际的转移时过于保守。为克服这一局限,本文提出一种新颖框架——d-矩形线性RRMDP(d-RRMDP),该框架在转移核与正则化项中同时引入潜在结构。我们聚焦于离线强化学习场景,即智能体从标称环境中预收集的数据集中学习策略。我们开发了鲁棒正则化悲观值迭代(R2PVI)算法,该算法采用线性函数逼近,在具有基于f-散度的转移核正则化项的d-RRMDP中实现鲁棒策略学习。我们给出了R2PVI策略次优性差距的实例相关上界,证明这些上界受数据集覆盖鲁棒可容许转移下最优鲁棒策略访问的状态-动作空间的程度影响。我们建立了信息论下界以验证算法的近最优性。最后,数值实验表明R2PVI能够学习鲁棒策略,且相比基线方法具有更优的计算效率。