In embodied AI perception systems, visual perception should be active: the goal is not to passively process static images, but to actively acquire more informative data within pixel and spatial budget constraints. Existing vision models and fixed RGB-D camera systems fundamentally fail to reconcile wide-area coverage with fine-grained detail acquisition, severely limiting their efficacy in open-world robotic applications. To address this issue, we propose EyeVLA, a robotic eyeball for active visual perception that can take proactive actions based on instructions, enabling clear observation of fine-grained target objects and detailed information across a wide spatial extent. EyeVLA discretizes action behaviors into action tokens and integrates them with vision-language models (VLMs) that possess strong open-world understanding capabilities, enabling joint modeling of vision, language, and actions within a single autoregressive sequence. By using the 2D bounding box coordinates to guide the reasoning chain and applying reinforcement learning to refine the viewpoint selection policy, we transfer the open-world scene understanding capability of the VLM to a vision language action (VLA) policy using only minimal real-world data. Experiments show that our system efficiently performs instructed scenes in real-world environments and actively acquires more accurate visual information through instruction-driven actions of rotation and zoom, thereby achieving strong environmental perception capabilities. EyeVLA introduces a novel robotic vision system that leverages detailed and spatially rich, large-scale embodied data, and actively acquires highly informative visual observations for downstream embodied tasks.


翻译:在具身AI感知系统中,视觉感知应当是主动的:其目标并非被动处理静态图像,而是在像素与空间预算约束下主动获取更具信息量的数据。现有视觉模型与固定RGB-D相机系统从根本上无法兼顾广域覆盖与细粒度细节获取,严重限制了其在开放世界机器人应用中的效能。为解决此问题,我们提出EyeVLA——一种用于主动视觉感知的机器人眼球,能够根据指令采取主动行为,实现对细粒度目标物体及广阔空间范围内细节信息的清晰观测。EyeVLA将动作行为离散化为动作令牌,并与具备强大开放世界理解能力的视觉语言模型(VLMs)相集成,实现在单一自回归序列中对视觉、语言与动作的联合建模。通过使用2D边界框坐标引导推理链,并应用强化学习优化视点选择策略,我们仅需极少量的真实世界数据,即可将VLM的开放世界场景理解能力迁移至视觉语言动作(VLA)策略。实验表明,我们的系统能在真实世界环境中高效执行指令场景,并通过旋转与缩放等指令驱动动作主动获取更精确的视觉信息,从而获得强大的环境感知能力。EyeVLA引入了一种新颖的机器人视觉系统,它利用细节丰富且空间覆盖广泛的大规模具身数据,并主动获取高信息量的视觉观测以支持下游具身任务。

0
下载
关闭预览

相关内容

Top
微信扫码咨询专知VIP会员