Robots that learn manipulation skills from everyday human videos could acquire broad capabilities without tedious robot data collection. We propose a video-to-video translation framework that converts ordinary human-object interaction videos into motion-consistent robot manipulation videos with realistic, physically grounded interactions. Our approach does not require any paired human-robot videos for training only a set of unpaired robot videos, making the system easy to scale. We introduce a transferable representation that bridges the embodiment gap: by inpainting the robot arm in training videos to obtain a clean background and overlaying a simple visual cue (a marker and arrow indicating the gripper's position and orientation), we can condition a generative model to insert the robot arm back into the scene. At test time, we apply the same process to human videos (inpainting the person and overlaying human pose cues) and generate high-quality robot videos that mimic the human's actions. We fine-tune a SOTA video diffusion model (Wan 2.2) in an in-context learning manner to ensure temporal coherence and leveraging of its rich prior knowledge. Empirical results demonstrate that our approach achieves significantly more realistic and grounded robot motions compared to baselines, pointing to a promising direction for scaling up robot learning from unlabeled human videos. Project page: https://showlab.github.io/H2R-Grounder/
翻译:能够从日常人类视频中学习操作技能的机器人,无需繁琐的机器人数据收集即可获得广泛的能力。我们提出了一种视频到视频的翻译框架,该框架将普通的人-物交互视频转换为具有现实、物理接地交互且运动一致的机器人操作视频。我们的方法在训练时不需要任何配对的人类-机器人视频,仅需一组未配对的机器人视频,使得系统易于扩展。我们引入了一种可迁移的表征来弥合具身鸿沟:通过在训练视频中对机器人手臂进行修复以获取干净的背景,并叠加一个简单的视觉提示(指示夹爪位置和方向的标记与箭头),我们可以条件化一个生成模型,将机器人手臂重新插入场景中。在测试时,我们对人类视频应用相同的过程(修复人物并叠加人体姿态提示),并生成模仿人类动作的高质量机器人视频。我们以情境学习的方式微调了一个最先进的视频扩散模型(Wan 2.2),以确保时间一致性并利用其丰富的先验知识。实证结果表明,与基线方法相比,我们的方法实现了显著更真实、更接地的机器人运动,这为从无标签人类视频中扩展机器人学习指明了一个有前景的方向。项目页面:https://showlab.github.io/H2R-Grounder/