We argue that progress in true multimodal intelligence calls for a shift from reactive, task-driven systems and brute-force long context towards a broader paradigm of supersensing. We frame spatial supersensing as four stages beyond linguistic-only understanding: semantic perception (naming what is seen), streaming event cognition (maintaining memory across continuous experiences), implicit 3D spatial cognition (inferring the world behind pixels), and predictive world modeling (creating internal models that filter and organize information). Current benchmarks largely test only the early stages, offering narrow coverage of spatial cognition and rarely challenging models in ways that require true world modeling. To drive progress in spatial supersensing, we present VSI-SUPER, a two-part benchmark: VSR (long-horizon visual spatial recall) and VSC (continual visual spatial counting). These tasks require arbitrarily long video inputs yet are resistant to brute-force context expansion. We then test data scaling limits by curating VSI-590K and training Cambrian-S, achieving +30% absolute improvement on VSI-Bench without sacrificing general capabilities. Yet performance on VSI-SUPER remains limited, indicating that scale alone is insufficient for spatial supersensing. We propose predictive sensing as a path forward, presenting a proof-of-concept in which a self-supervised next-latent-frame predictor leverages surprise (prediction error) to drive memory and event segmentation. On VSI-SUPER, this approach substantially outperforms leading proprietary baselines, showing that spatial supersensing requires models that not only see but also anticipate, select, and organize experience.
翻译:我们认为,迈向真正的多模态智能需要从反应式、任务驱动的系统及暴力长上下文处理转向更广泛的超感知范式。我们将空间超感知定义为超越纯语言理解的四个阶段:语义感知(命名所见内容)、流式事件认知(在连续体验中维持记忆)、隐式三维空间认知(推断像素背后的世界)以及预测性世界建模(创建用于筛选和组织信息的内部模型)。当前基准主要仅测试早期阶段,对空间认知的覆盖范围狭窄,且很少以需要真实世界建模的方式挑战模型。为推进空间超感知研究,我们提出VSI-SUPER双部分基准:VSR(长时程视觉空间回忆)和VSC(持续视觉空间计数)。这些任务需要任意长度的视频输入,且能抵抗暴力上下文扩展。随后我们通过构建VSI-590K数据集并训练Cambrian-S模型来测试数据扩展极限,在VSI-Bench上实现了+30%的绝对性能提升且未牺牲通用能力。然而在VSI-SUPER上的表现仍有限,表明仅靠规模扩展不足以实现空间超感知。我们提出预测性感知作为前进方向,展示了一个概念验证:自监督的下一潜在帧预测器利用意外性(预测误差)驱动记忆与事件分割。在VSI-SUPER基准上,该方法显著优于领先的专有基线模型,证明空间超感知需要模型不仅能观察,更能预测、筛选并组织经验。