Embodied intelligence systems, which enhance agent capabilities through continuous environment interactions, have garnered significant attention from both academia and industry. Vision-Language-Action models, inspired by advancements in large foundation models, serve as universal robotic control frameworks that substantially improve agent-environment interaction capabilities in embodied intelligence systems. This expansion has broadened application scenarios for embodied AI robots. This survey comprehensively reviews VLA models for embodied manipulation. Firstly, it chronicles the developmental trajectory of VLA architectures. Subsequently, we conduct a detailed analysis of current research across 5 critical dimensions: VLA model structures, training datasets, pre-training methods, post-training methods, and model evaluation. Finally, we synthesize key challenges in VLA development and real-world deployment, while outlining promising future research directions.
翻译:具身智能系统通过持续的环境交互增强智能体能力,已引起学术界与工业界的广泛关注。受大型基础模型进展的启发,视觉-语言-动作模型作为通用机器人控制框架,显著提升了具身智能系统中智能体与环境的交互能力。这一拓展拓宽了具身AI机器人的应用场景。本文全面综述了面向具身操作的VLA模型。首先,梳理了VLA架构的发展脉络。随后,我们从五个关键维度对当前研究进行了详细分析:VLA模型结构、训练数据集、预训练方法、后训练方法以及模型评估。最后,我们总结了VLA开发与实际部署中的核心挑战,并展望了未来潜在的研究方向。