Vision-Language-Action (VLA) models trained via imitation learning suffer from significant performance degradation in data-scarce scenarios due to their reliance on large-scale demonstration datasets. Although reinforcement learning (RL)-based post-training has proven effective in addressing data scarcity, its application to VLA models is hindered by the non-resettable nature of real-world environments. This limitation is particularly critical in high-risk domains such as industrial automation, where interactions often induce state changes that are costly or infeasible to revert. Furthermore, existing VLA approaches lack a reliable mechanism for detecting task completion, leading to redundant actions that reduce overall task success rates. To address these challenges, we propose World-Env, an RL-based post-training framework that replaces physical interaction with a low-cost, world model-based virtual simulator. World-Env consists of two key components: (1) a video-based world simulator that generates temporally consistent future visual observations, and (2) a vision-language model (VLM)-guided instant reflector that provides continuous reward signals and predicts action termination. This simulated environment enables VLA models to safely explore and generalize beyond their initial imitation learning distribution. Our method achieves notable performance gains with as few as five expert demonstrations per task. Experiments on complex robotic manipulation tasks demonstrate that World-Env effectively overcomes the data inefficiency, safety constraints, and inefficient execution of conventional VLA models that rely on real-world interaction, offering a practical and scalable solution for post-training in resource-constrained settings. Our code is available at https://github.com/amap-cvlab/world-env.
翻译:通过模仿学习训练的视觉-语言-动作(VLA)模型因其对大规模演示数据集的依赖,在数据稀缺场景下会遭受显著的性能下降。尽管基于强化学习(RL)的后训练已被证明能有效应对数据稀缺问题,但其在VLA模型中的应用受到现实环境不可重置特性的阻碍。这一限制在工业自动化等高风险领域尤为关键,因为交互常引发状态变化,而这些变化成本高昂或难以逆转。此外,现有VLA方法缺乏可靠的任务完成检测机制,导致冗余动作,降低了整体任务成功率。为应对这些挑战,我们提出了World-Env,一种基于RL的后训练框架,它用低成本、基于世界模型的虚拟模拟器替代物理交互。World-Env包含两个关键组件:(1)基于视频的世界模拟器,生成时间一致的未来视觉观测;(2)视觉-语言模型(VLM)引导的即时反射器,提供连续奖励信号并预测动作终止。这一模拟环境使VLA模型能够安全探索并泛化至其初始模仿学习分布之外。我们的方法在每项任务仅需五个专家演示的情况下即实现了显著的性能提升。在复杂机器人操作任务上的实验表明,World-Env有效克服了依赖真实世界交互的传统VLA模型的数据低效性、安全约束和执行效率低下问题,为资源受限环境下的后训练提供了实用且可扩展的解决方案。我们的代码发布于https://github.com/amap-cvlab/world-env。