Offline-to-online reinforcement learning (O2O-RL) has emerged as a promising paradigm for safe and efficient robotic policy deployment but suffers from two fundamental challenges: limited coverage of multimodal behaviors and distributional shifts during online adaptation. We propose UEPO, a unified generative framework inspired by large language model pretraining and fine-tuning strategies. Our contributions are threefold: (1) a multi-seed dynamics-aware diffusion policy that efficiently captures diverse modalities without training multiple models; (2) a dynamic divergence regularization mechanism that enforces physically meaningful policy diversity; and (3) a diffusion-based data augmentation module that enhances dynamics model generalization. On the D4RL benchmark, UEPO achieves +5.9\% absolute improvement over Uni-O4 on locomotion tasks and +12.4\% on dexterous manipulation, demonstrating strong generalization and scalability.
翻译:离线到在线强化学习(O2O-RL)已成为一种安全高效部署机器人策略的有前景范式,但面临两个基本挑战:多模态行为覆盖有限以及在线适应过程中的分布偏移。我们提出UEPO,一个受大语言模型预训练与微调策略启发的统一生成框架。我们的贡献包括三个方面:(1)一种多种子动态感知扩散策略,无需训练多个模型即可高效捕捉多样模态;(2)一种动态散度正则化机制,强制实现物理意义上有效的策略多样性;(3)一个基于扩散的数据增强模块,提升动态模型的泛化能力。在D4RL基准测试中,UEPO在运动任务上相比Uni-O4实现绝对提升+5.9%,在灵巧操作任务上提升+12.4%,展现出强大的泛化性与可扩展性。