Unlocking spatial reasoning in Multimodal Large Language Models (MLLMs) is crucial for enabling intelligent interaction with 3D environments. While prior efforts often rely on explicit 3D inputs or specialized model architectures, we ask: can MLLMs reason about 3D space using only structured 2D representations derived from perception? We introduce Struct2D, a perception-guided prompting framework that combines bird's-eye-view (BEV) images with object marks and object-centric metadata, optionally incorporating egocentric keyframes when needed. Using Struct2D, we conduct an in-depth zero-shot analysis of closed-source MLLMs (e.g., GPT-o3) and find that they exhibit surprisingly strong spatial reasoning abilities when provided with structured 2D inputs, effectively handling tasks such as relative direction estimation and route planning. Building on these insights, we construct Struct2D-Set, a large-scale instruction tuning dataset with 200K fine-grained QA pairs across eight spatial reasoning categories, generated automatically from 3D indoor scenes. We fine-tune an open-source MLLM (Qwen2.5VL) on Struct2D-Set, achieving competitive performance on multiple benchmarks, including 3D question answering, dense captioning, and object grounding. Our approach demonstrates that structured 2D inputs can effectively bridge perception and language reasoning in MLLMs-without requiring explicit 3D representations as input. We will release both our code and dataset to support future research.
翻译:在多模态大语言模型(MLLMs)中解锁空间推理能力对于实现与三维环境的智能交互至关重要。尽管先前的研究通常依赖于显式的三维输入或专用模型架构,但我们提出:MLLMs能否仅利用源自感知的结构化二维表征进行三维空间推理?我们提出了Struct2D,一种感知引导的提示框架,它结合了鸟瞰图(BEV)图像、物体标记以及以物体为中心的元数据,并可根据需要选择性融入以自我为中心的关键帧。利用Struct2D,我们对闭源MLLMs(例如GPT-o3)进行了深入的零样本分析,发现当提供结构化二维输入时,它们展现出令人惊讶的强大空间推理能力,能够有效处理相对方向估计和路径规划等任务。基于这些发现,我们构建了Struct2D-Set,这是一个大规模指令微调数据集,包含从三维室内场景自动生成的、涵盖八个空间推理类别的20万个细粒度问答对。我们在Struct2D-Set上对一个开源MLLM(Qwen2.5VL)进行了微调,在多项基准测试(包括三维问答、密集描述和物体定位)中取得了有竞争力的性能。我们的方法表明,结构化二维输入能够有效弥合MLLMs中的感知与语言推理,而无需依赖显式的三维表征作为输入。我们将公开代码和数据集以支持未来研究。