Recent progress in large-scale robotic datasets and vision-language models (VLMs) has advanced research on vision-language-action (VLA) models. However, existing VLA models still face two fundamental challenges: (i) producing precise low-level actions from high-dimensional observations, (ii) bridging domain gaps across heterogeneous data sources, including diverse robot embodiments and human demonstrations. Existing methods often encode latent variables from either visual dynamics or robotic actions to guide policy learning, but they fail to fully exploit the complementary multi-modal knowledge present in large-scale, heterogeneous datasets. In this work, we present X Robotic Model 1 (XR-1), a novel framework for versatile and scalable VLA learning across diverse robots, tasks, and environments. XR-1 introduces the \emph{Unified Vision-Motion Codes (UVMC)}, a discrete latent representation learned via a dual-branch VQ-VAE that jointly encodes visual dynamics and robotic motion. UVMC addresses these challenges by (i) serving as an intermediate representation between the observations and actions, and (ii) aligning multimodal dynamic information from heterogeneous data sources to capture complementary knowledge. To effectively exploit UVMC, we propose a three-stage training paradigm: (i) self-supervised UVMC learning, (ii) UVMC-guided pretraining on large-scale cross-embodiment robotic datasets, and (iii) task-specific post-training. We validate XR-1 through extensive real-world experiments with more than 14,000 rollouts on six different robot embodiments, spanning over 120 diverse manipulation tasks. XR-1 consistently outperforms state-of-the-art baselines such as $\pi_{0.5}$, $\pi_0$, RDT, UniVLA, and GR00T-N1.5 while demonstrating strong generalization to novel objects, background variations, distractors, and illumination changes. Our project is at https://xr-1-vla.github.io/.
翻译:大规模机器人数据集和视觉语言模型(VLMs)的最新进展推动了视觉-语言-动作(VLA)模型的研究。然而,现有VLA模型仍面临两个基本挑战:(i)从高维观测中生成精确的低级动作;(ii)弥合异构数据源(包括多样化的机器人本体和人类演示)之间的领域鸿沟。现有方法通常从视觉动态或机器人动作中编码潜变量以指导策略学习,但未能充分利用大规模异构数据集中存在的互补多模态知识。本文提出X机器人模型1(XR-1),这是一个面向多样化机器人、任务和环境的通用可扩展VLA学习框架。XR-1引入了统一视觉-运动编码(UVMC),这是一种通过双分支VQ-VAE联合编码视觉动态和机器人运动而学习的离散潜表征。UVMC通过以下方式应对这些挑战:(i)作为观测与动作之间的中间表征;(ii)对齐来自异构数据源的多模态动态信息以捕获互补知识。为有效利用UVMC,我们提出三阶段训练范式:(i)自监督UVMC学习;(ii)基于大规模跨本体机器人数据集的UVMC引导预训练;(iii)任务特定后训练。我们在六种不同机器人本体上通过超过14,000次实际环境实验验证XR-1,涵盖120余种多样化操作任务。XR-1在新型物体、背景变化、干扰物和光照变化中展现出强大泛化能力,其性能持续优于最先进基线模型,如$\\pi_{0.5}$、$\\pi_0$、RDT、UniVLA和GR00T-N1.5。项目主页:https://xr-1-vla.github.io/。