The success of foundation models in language and vision motivated research in fully end-to-end robot navigation foundation models (NFMs). NFMs directly map monocular visual input to control actions and ignore mid-level vision modules (tracking, depth estimation, etc) entirely. While the assumption that vision capabilities will emerge implicitly is compelling, it requires large amounts of pixel-to-action supervision that are difficult to obtain. The challenge is especially pronounced in dynamic and unstructured settings, where robust navigation requires precise geometric and dynamic understanding, while the depth-scale ambiguity in monocular views further limits accurate spatial reasoning. In this paper, we show that relying on monocular vision and ignoring mid-level vision priors is inefficient. We present StereoWalker, which augments NFMs with stereo inputs and explicit mid-level vision such as depth estimation and dense pixel tracking. Our intuition is straightforward: stereo inputs resolve the depth-scale ambiguity, and modern mid-level vision models provide reliable geometric and motion structure in dynamic scenes. We also curate a large stereo navigation dataset with automatic action annotation from Internet stereo videos to support training of StereoWalker and to facilitate future research. Through our experiments, we find that mid-level vision enables StereoWalker to achieve a comparable performance as the state-of-the-art using only 1.5% of the training data, and surpasses the state-of-the-art using the full data. We also observe that stereo vision yields higher navigation performance than monocular input.
翻译:基础模型在语言和视觉领域的成功激发了关于全端到端机器人导航基础模型(NFMs)的研究。NFMs直接将单目视觉输入映射为控制动作,完全忽略了中层视觉模块(如跟踪、深度估计等)。尽管视觉能力将隐式涌现的假设颇具吸引力,但这需要大量难以获取的像素到动作的监督数据。在动态和非结构化环境中,这一挑战尤为突出,因为鲁棒的导航需要精确的几何和动态理解,而单目视图中的深度尺度模糊性进一步限制了准确的空间推理。本文中,我们证明依赖单目视觉并忽略中层视觉先验是低效的。我们提出了StereoWalker,它通过双目输入和显式中层视觉(如深度估计和密集像素跟踪)增强了NFMs。我们的直觉很直接:双目输入解决了深度尺度模糊性问题,而现代中层视觉模型为动态场景提供了可靠的几何和运动结构。我们还构建了一个大型双目导航数据集,其中包含从互联网双目视频中自动标注的动作,以支持StereoWalker的训练并促进未来研究。通过实验,我们发现中层视觉使StereoWalker仅使用1.5%的训练数据即可达到与最先进方法相当的性能,并在使用全部数据时超越了最先进方法。我们还观察到,双目视觉比单目输入带来了更高的导航性能。