Data scaling has long remained a critical bottleneck in robot learning. For humanoid robots, human videos and motion data are abundant and widely available, offering a free and large-scale data source. Besides, the semantics related to the motions enable modality alignment and high-level robot control learning. However, how to effectively mine raw video, extract robot-learnable representations, and leverage them for scalable learning remains an open problem. To address this, we introduce Humanoid-Union, a large-scale dataset generated through an autonomous pipeline, comprising over 260 hours of diverse, high-quality humanoid robot motion data with semantic annotations derived from human motion videos. The dataset can be further expanded via the same pipeline. Building on this data resource, we propose SCHUR, a scalable learning framework designed to explore the impact of large-scale data on high-level control in humanoid robots. Experimental results demonstrate that SCHUR achieves high robot motion generation quality and strong text-motion alignment under data and model scaling, with 37\% reconstruction improvement under MPJPE and 25\% alignment improvement under FID comparing with previous methods. Its effectiveness is further validated through deployment in real-world humanoid robot.
翻译:数据规模长期以来一直是机器人学习的关键瓶颈。对于人形机器人而言,人类视频与运动数据丰富且易于获取,提供了免费的大规模数据源。此外,与运动相关的语义信息能够实现模态对齐和高级机器人控制学习。然而,如何有效挖掘原始视频、提取机器人可学习的表征,并将其用于可扩展学习,仍是一个开放性问题。为此,我们提出了Humanoid-Union,这是一个通过自动化流程生成的大规模数据集,包含超过260小时多样化、高质量的人形机器人运动数据,并附带源自人类运动视频的语义标注。该数据集可通过相同流程进一步扩展。基于这一数据资源,我们提出了SCHUR,一个旨在探索大规模数据对人形机器人高级控制影响的可扩展学习框架。实验结果表明,SCHUR在数据和模型规模扩展下实现了高机器人运动生成质量与强文本-运动对齐能力,与先前方法相比,在MPJPE指标下重建性能提升37%,在FID指标下对齐性能提升25%。其有效性通过在实际人形机器人上的部署得到了进一步验证。