Gemini Robotics 1.5：通过高级具身推理、思维与运动迁移推动通用机器人前沿发展 (Gemini Robotics 1.5: Pushing the Frontier of Generalist Robots with Advanced Embodied Reasoning, Thinking, and Motion Transfer)

Gemini Robotics Team,Abbas Abdolmaleki,Saminda Abeyruwan,Joshua Ainslie,Jean-Baptiste Alayrac,Montserrat Gonzalez Arenas,Ashwin Balakrishna,Nathan Batchelor,Alex Bewley,Jeff Bingham,Michael Bloesch,Konstantinos Bousmalis,Philemon Brakel,Anthony Brohan,Thomas Buschmann,Arunkumar Byravan,Serkan Cabi,Ken Caluwaerts,Federico Casarini,Christine Chan,Oscar Chang,London Chappellet-Volpini,Jose Enrique Chen,Xi Chen,Hao-Tien Lewis Chiang,Krzysztof Choromanski,Adrian Collister,David B. D'Ambrosio,Sudeep Dasari,Todor Davchev,Meet Kirankumar Dave,Coline Devin,Norman Di Palo,Tianli Ding,Carl Doersch,Adil Dostmohamed,Yilun Du,Debidatta Dwibedi,Sathish Thoppay Egambaram,Michael Elabd,Tom Erez,Xiaolin Fang,Claudio Fantacci,Cody Fong,Erik Frey,Chuyuan Fu,Ruiqi Gao,Marissa Giustina,Keerthana Gopalakrishnan,Laura Graesser,Oliver Groth,Agrim Gupta,Roland Hafner,Steven Hansen,Leonard Hasenclever,Sam Haves,Nicolas Heess,Brandon Hernaez,Alex Hofer,Jasmine Hsu,Lu Huang,Sandy H. Huang,Atil Iscen,Mithun George Jacob,Deepali Jain,Sally Jesmonth,Abhishek Jindal,Ryan Julian,Dmitry Kalashnikov,M. Emre Karagozler,Stefani Karp,Matija Kecman,J. Chase Kew,Donnie Kim,Frank Kim,Junkyung Kim,Thomas Kipf,Sean Kirmani,Ksenia Konyushkova,Li Yang Ku,Yuheng Kuang,Thomas Lampe,Antoine Laurens,Tuan Anh Le,Isabel Leal,Alex X. Lee,Tsang-Wei Edward Lee,Guy Lever,Jacky Liang,Li-Heng Lin,Fangchen Liu,Shangbang Long,Caden Lu,Sharath Maddineni,Anirudha Majumdar,Kevis-Kokitsi Maninis,Andrew Marmon,Sergio Martinez,Assaf Hurwitz Michaely,Niko Milonopoulos,Joss Moore,Robert Moreno,Michael Neunert,Francesco Nori,Joy Ortiz,Kenneth Oslund,Carolina Parada,Emilio Parisotto,Amaris Paryag,Acorn Pooley,Thomas Power,Alessio Quaglino,Haroon Qureshi,Rajkumar Vasudeva Raju,Helen Ran,Dushyant Rao,Kanishka Rao,Isaac Reid,David Rendleman,Krista Reymann,Miguel Rivas,Francesco Romano,Yulia Rubanova,Peter Pastor Sampedro,Pannag R Sanketi,Dhruv Shah,Mohit Sharma,Kathryn Shea,Mohit Shridhar,Charles Shu,Vikas Sindhwani,Sumeet Singh,Radu Soricut,Rachel Sterneck,Ian Storz,Razvan Surdulescu,Jie Tan,Jonathan Tompson,Saran Tunyasuvunakool,Jake Varley,Grace Vesom,Giulia Vezzani,Maria Bauza Villalonga,Oriol Vinyals,René Wagner,Ayzaan Wahid,Stefan Welker,Paul Wohlhart,Chengda Wu,Markus Wulfmeier,Fei Xia,Ted Xiao,Annie Xie,Jinyu Xie,Peng Xu,Sichun Xu,Ying Xu,Zhuo Xu,Jimmy Yan,Sherry Yang,Skye Yang,Yuxiang Yang,Hiu Hong Yu,Wenhao Yu,Wentao Yuan,Yuan Yuan,Jingwei Zhang,Tingnan Zhang,Zhiyuan Zhang,Allan Zhou,Guangyao Zhou,Yuxiang Zhou

General-purpose robots need a deep understanding of the physical world, advanced reasoning, and general and dexterous control. This report introduces the latest generation of the Gemini Robotics model family: Gemini Robotics 1.5, a multi-embodiment Vision-Language-Action (VLA) model, and Gemini Robotics-ER 1.5, a state-of-the-art Embodied Reasoning (ER) model. We are bringing together three major innovations. First, Gemini Robotics 1.5 features a novel architecture and a Motion Transfer (MT) mechanism, which enables it to learn from heterogeneous, multi-embodiment robot data and makes the VLA more general. Second, Gemini Robotics 1.5 interleaves actions with a multi-level internal reasoning process in natural language. This enables the robot to "think before acting" and notably improves its ability to decompose and execute complex, multi-step tasks, and also makes the robot's behavior more interpretable to the user. Third, Gemini Robotics-ER 1.5 establishes a new state-of-the-art for embodied reasoning, i.e., for reasoning capabilities that are critical for robots, such as visual and spatial understanding, task planning, and progress estimation. Together, this family of models takes us a step towards an era of physical agents-enabling robots to perceive, think and then act so they can solve complex multi-step tasks.

翻译：通用机器人需要对物理世界有深刻理解、具备高级推理能力以及通用且灵巧的控制能力。本报告介绍了Gemini Robotics模型家族的最新代次：Gemini Robotics 1.5——一个多具身的视觉-语言-动作（VLA）模型，以及Gemini Robotics-ER 1.5——一个最先进的具身推理（ER）模型。我们汇集了三大创新。首先，Gemini Robotics 1.5采用了一种新颖的架构和运动迁移（MT）机制，使其能够从异构、多具身的机器人数据中学习，从而使VLA模型更具通用性。其次，Gemini Robotics 1.5将动作与多层次的内部自然语言推理过程交织在一起。这使得机器人能够“先思考后行动”，显著提升了其分解和执行复杂多步骤任务的能力，同时也使机器人的行为对用户而言更具可解释性。第三，Gemini Robotics-ER 1.5为具身推理确立了新的最先进水平，即对于机器人至关重要的推理能力，如视觉与空间理解、任务规划和进度评估。总之，这一模型家族使我们朝着物理智能体时代迈进了一步——使机器人能够感知、思考然后行动，从而解决复杂的多步骤任务。

相关内容

Gemini

关注 11

2023年12 月 6 日，谷歌 CEO 桑达尔・皮查伊官宣 Gemini 1.0 版正式上线。这次发布的 Gemini 大模型是原生多模态大模型，是谷歌大模型新时代的第一步，它包括三种量级：能力最强的 Gemini Ultra，适用于多任务的 Gemini Pro 以及适用于特定任务和端侧的 Gemini Nano。

ChatGPT 背后的“功臣”——RLHF 技术详解

专知会员服务

169+阅读 · 2023年2月21日

O’Reilly报告：知识图谱崛起——面向现代数据集成和数据结构体系，“The Rise of the Knowledge Graph——Toward Modern Data Integration and the Data Fabric Architecture”

专知会员服务

48+阅读 · 2022年2月18日

语言视觉预训练语言模型揭密，Behind the Scene: Revealing the Secrets of Pre-trained Vision-and-Language Models

专知会员服务

36+阅读 · 2020年5月20日

【AAAI2020论文】关注实体以更好地理解文本（Attending to Entities for Better Text Understanding）

专知会员服务

25+阅读 · 2019年11月15日