引导流策略：在离线强化学习中从高价值动作中学习 (Guided Flow Policy: Learning from High-Value Actions in Offline Reinforcement Learning)

Offline reinforcement learning often relies on behavior regularization that enforces policies to remain close to the dataset distribution. However, such approaches fail to distinguish between high-value and low-value actions in their regularization components. We introduce Guided Flow Policy (GFP), which couples a multi-step flow-matching policy with a distilled one-step actor. The actor directs the flow policy through weighted behavior cloning to focus on cloning high-value actions from the dataset rather than indiscriminately imitating all state-action pairs. In turn, the flow policy constrains the actor to remain aligned with the dataset's best transitions while maximizing the critic. This mutual guidance enables GFP to achieve state-of-the-art performance across 144 state and pixel-based tasks from the OGBench, Minari, and D4RL benchmarks, with substantial gains on suboptimal datasets and challenging tasks. Webpage: https://simple-robotics.github.io/publications/guided-flow-policy/

翻译：离线强化学习通常依赖于行为正则化方法，强制策略保持接近数据集分布。然而，这类方法在其正则化组件中未能区分高价值与低价值动作。本文提出引导流策略（GFP），它将多步流匹配策略与蒸馏的一步执行器相耦合。该执行器通过加权行为克隆引导流策略，专注于克隆数据集中的高价值动作，而非不加区分地模仿所有状态-动作对。反过来，流策略约束执行器在最大化评论家价值的同时，保持与数据集中最佳转移的对齐。这种相互引导机制使GFP在OGBench、Minari和D4RL基准测试的144个基于状态和像素的任务中实现了最先进的性能，在次优数据集和挑战性任务上取得了显著提升。项目页面：https://simple-robotics.github.io/publications/guided-flow-policy/

相关内容

数据集

关注 0

数据集，又称为资料集、数据集合或资料集合，是一种由数据所组成的集合。
Data set（或dataset）是一个数据的集合，通常以表格形式出现。每一列代表一个特定变量。每一行都对应于某一成员的数据集的问题。它列出的价值观为每一个变量，如身高和体重的一个物体或价值的随机数。每个数值被称为数据资料。对应于行数，该数据集的数据可能包括一个或多个成员。

[ICML2024]消除偏差：微调基础模型以进行半监督学习

专知会员服务

17+阅读 · 2024年5月23日

【NeurIPS2023】CQM: 与量化世界模型的课程强化学习

专知会员服务

25+阅读 · 2023年10月29日

【ICML2023】SEGA:结构熵引导的图对比学习锚视图

专知会员服务

22+阅读 · 2023年5月10日

【NeurIPS2022】VICRegL:局部视觉特征的自监督学习

专知会员服务

32+阅读 · 2022年10月6日