欧几里得的馈赠：通过几何代理任务增强视觉语言模型的空间感知与推理能力 (Euclid's Gift: Enhancing Spatial Perception and Reasoning in Vision-Language Models via Geometric Surrogate Tasks)

Spatial intelligence spans a rich suite of abilities, including visualising and transforming shapes, mentally rotating objects, judging relational positions and containment, and estimating numerosity. However, it still remains a critical unresolved challenge for Multimodal Large Language Models (MLLMs). To fill this gap, we propose to treat Euclidean geometry problem-solving as a surrogate task. Specifically, we meticulously constructed a curated multimodal dataset, called Euclid30K, comprising approximately 30K plane and solid geometry problems. Furthermore, to enable the model to learn and apply Euclidean principles from these geometry problems, we fine-tuned seven model variants (spanning 3--72B parameters) from the Qwen2.5VL, Qwen3VL, and RoboBrain2.0 families using Group Relative Policy Optimization (GRPO), inspiring the models to identify shapes, count, and relate entities, and perform multi-step deductive reasoning using Euclidean principles. Our experiments demonstrate that the resulting models achieve substantial zero-shot gains across four spatial reasoning benchmarks (Super-CLEVR, Omni3DBench, VSI-Bench, and MindCube) without any task-specific adaptations. Notably, after training on the Euclid30K, the mean VSI-Bench accuracy rose from 36.6\% to 41.8\% (+5.2\%), and the mean MindCube accuracy rose from 31.4\% to 38.1\% (+6.7\%). To our knowledge, this is the first systematic study showing that geometry-centric fine-tuning can confer vision-language models with broadly transferable spatial skills. Code and Euclid30K dataset can be found in \href{https://zgca-ai4edu.github.io/Euclids_Gift}{this}.

翻译：空间智能涵盖一系列丰富的能力，包括形状可视化与变换、物体心理旋转、关系位置与包含关系判断以及数量估计。然而，这仍然是多模态大语言模型（MLLMs）面临的关键未解难题。为填补这一空白，我们提出将欧几里得几何问题求解作为代理任务。具体而言，我们精心构建了一个名为Euclid30K的多模态数据集，包含约3万个平面与立体几何问题。此外，为使模型能从这些几何问题中学习并应用欧几里得原理，我们采用分组相对策略优化（GRPO）对Qwen2.5VL、Qwen3VL和RoboBrain2.0系列的七个模型变体（参数量覆盖3B至72B）进行微调，激励模型识别形状、计数、关联实体，并运用欧几里得原理进行多步演绎推理。实验表明，所得模型在四个空间推理基准测试（Super-CLEVR、Omni3DBench、VSI-Bench和MindCube）上无需任何任务特定适配即实现显著的零样本性能提升。值得注意的是，经过Euclid30K训练后，VSI-Bench平均准确率从36.6%提升至41.8%（+5.2%），MindCube平均准确率从31.4%提升至38.1%（+6.7%）。据我们所知，这是首个系统研究表明以几何为中心的微调能为视觉语言模型赋予广泛可迁移的空间能力。代码与Euclid30K数据集可通过\\href{https://zgca-ai4edu.github.io/Euclids_Gift}{此链接}获取。