Despite remarkable progress, multimodal foundation models still exhibit surprising deficiencies in spatial intelligence. In this work, we explore scaling up multimodal foundation models to cultivate spatial intelligence within the SenseNova-SI family, built upon established multimodal foundations including visual understanding models (i.e., Qwen3-VL and InternVL3) and unified understanding and generation models (i.e., Bagel). We take a principled approach to constructing high-performing and robust spatial intelligence by systematically curating SenseNova-SI-8M: eight million diverse data samples under a rigorous taxonomy of spatial capabilities. SenseNova-SI demonstrates unprecedented performance across a broad range of spatial intelligence benchmarks: 68.7% on VSI-Bench, 43.3% on MMSI, 85.6% on MindCube, 54.6% on ViewSpatial, and 50.1% on SITE, while maintaining strong general multimodal understanding (e.g., 84.9% on MMBench-En). More importantly, we analyze the impact of data scaling, discuss early signs of emergent generalization capabilities enabled by diverse data training, analyze the risk of overfitting and language shortcuts, present a preliminary study on spatial chain-of-thought reasoning, and validate the potential downstream application. SenseNova-SI is an ongoing project, and this report will be updated continuously. All newly trained multimodal foundation models are publicly released to facilitate further research in this direction.
翻译:尽管取得了显著进展,多模态基础模型在空间智能方面仍表现出令人惊讶的不足。本研究探索通过扩展多模态基础模型来培养SenseNova-SI系列中的空间智能,该系列建立在成熟的多模态基础之上,包括视觉理解模型(即Qwen3-VL和InternVL3)以及统一理解与生成模型(即Bagel)。我们采用原则性方法构建高性能且鲁棒的空间智能,通过系统性地构建SenseNova-SI-8M:在严格的空间能力分类体系下,精心策划了八百万个多样化数据样本。SenseNova-SI在广泛的空间智能基准测试中展现出前所未有的性能:在VSI-Bench上达到68.7%,在MMSI上达到43.3%,在MindCube上达到85.6%,在ViewSpatial上达到54.6%,在SITE上达到50.1%,同时保持强大的通用多模态理解能力(例如在MMBench-En上达到84.9%)。更重要的是,我们分析了数据扩展的影响,讨论了多样化数据训练带来的涌现泛化能力的早期迹象,分析了过拟合和语言捷径的风险,提出了关于空间链式思维推理的初步研究,并验证了潜在的下游应用。SenseNova-SI是一个持续进行的项目,本报告将不断更新。所有新训练的多模态基础模型均已公开发布,以促进该方向的进一步研究。