General-purpose robots need a deep understanding of the physical world, advanced reasoning, and general and dexterous control. This report introduces the latest generation of the Gemini Robotics model family: Gemini Robotics 1.5, a multi-embodiment Vision-Language-Action (VLA) model, and Gemini Robotics-ER 1.5, a state-of-the-art Embodied Reasoning (ER) model. We are bringing together three major innovations. First, Gemini Robotics 1.5 features a novel architecture and a Motion Transfer (MT) mechanism, which enables it to learn from heterogeneous, multi-embodiment robot data and makes the VLA more general. Second, Gemini Robotics 1.5 interleaves actions with a multi-level internal reasoning process in natural language. This enables the robot to "think before acting" and notably improves its ability to decompose and execute complex, multi-step tasks, and also makes the robot's behavior more interpretable to the user. Third, Gemini Robotics-ER 1.5 establishes a new state-of-the-art for embodied reasoning, i.e., for reasoning capabilities that are critical for robots, such as visual and spatial understanding, task planning, and progress estimation. Together, this family of models takes us a step towards an era of physical agents-enabling robots to perceive, think and then act so they can solve complex multi-step tasks.
翻译:通用机器人需要对物理世界有深刻理解、具备高级推理能力以及通用且灵巧的控制能力。本报告介绍了Gemini Robotics模型家族的最新代次:Gemini Robotics 1.5——一个多具身的视觉-语言-动作(VLA)模型,以及Gemini Robotics-ER 1.5——一个最先进的具身推理(ER)模型。我们汇集了三大创新。首先,Gemini Robotics 1.5采用了一种新颖的架构和运动迁移(MT)机制,使其能够从异构、多具身的机器人数据中学习,从而使VLA模型更具通用性。其次,Gemini Robotics 1.5将动作与多层次的内部自然语言推理过程交织在一起。这使得机器人能够“先思考后行动”,显著提升了其分解和执行复杂多步骤任务的能力,同时也使机器人的行为对用户而言更具可解释性。第三,Gemini Robotics-ER 1.5为具身推理确立了新的最先进水平,即对于机器人至关重要的推理能力,如视觉与空间理解、任务规划和进度评估。总之,这一模型家族使我们朝着物理智能体时代迈进了一步——使机器人能够感知、思考然后行动,从而解决复杂的多步骤任务。