Despite the rise of billion-parameter foundation models trained across thousands of GPUs, similar scaling gains have not been shown for humanoid control. Current neural controllers for humanoids remain modest in size, target a limited behavior set, and are trained on a handful of GPUs over several days. We show that scaling up model capacity, data, and compute yields a generalist humanoid controller capable of creating natural and robust whole-body movements. Specifically, we posit motion tracking as a natural and scalable task for humanoid control, leverageing dense supervision from diverse motion-capture data to acquire human motion priors without manual reward engineering. We build a foundation model for motion tracking by scaling along three axes: network size (from 1.2M to 42M parameters), dataset volume (over 100M frames, 700 hours of high-quality motion data), and compute (9k GPU hours). Beyond demonstrating the benefits of scale, we show the practical utility of our model through two mechanisms: (1) a real-time universal kinematic planner that bridges motion tracking to downstream task execution, enabling natural and interactive control, and (2) a unified token space that supports various motion input interfaces, such as VR teleoperation devices, human videos, and vision-language-action (VLA) models, all using the same policy. Scaling motion tracking exhibits favorable properties: performance improves steadily with increased compute and data diversity, and learned representations generalize to unseen motions, establishing motion tracking at scale as a practical foundation for humanoid control.
翻译:尽管基于数千个GPU训练的十亿参数基础模型已取得显著进展,但人形机器人控制领域尚未实现类似的规模化效益。当前人形机器人的神经控制器规模仍较为有限,仅针对少量行为模式进行优化,且通常在数个GPU上训练数日。本研究证明,通过扩展模型容量、数据规模和计算资源,能够构建出可生成自然且鲁棒全身运动的通用人形机器人控制器。具体而言,我们将运动追踪定位为人形机器人控制的天然可扩展任务,利用多样化运动捕捉数据的密集监督来获取人体运动先验知识,无需人工设计奖励函数。我们通过三个维度的扩展构建了运动追踪基础模型:网络规模(参数量从120万增至4200万)、数据集体量(超过1亿帧、700小时高质量运动数据)和计算资源(9000 GPU小时)。除论证规模化优势外,我们通过两种机制展示模型的实用价值:(1)实时通用运动规划器,将运动追踪与下游任务执行相衔接,实现自然交互式控制;(2)统一表征空间,支持多种运动输入接口(如VR遥操作设备、人体视频、视觉-语言-动作模型)使用同一策略。规模化运动追踪展现出优越特性:性能随计算资源与数据多样性增加而稳步提升,学习到的表征能够泛化至未见运动模式,从而确立大规模运动追踪作为人形机器人控制的实用基础。