Recent advances in Multimodal Large Language Models (MLLMs) have significantly enhanced their capabilities; however, their spatial perception abilities remain a notable limitation. To address this challenge, multimodal data synthesis offers a promising solution. Yet, ensuring that synthesized data adhere to spatial common sense is a non-trivial task. Our approach addresses this critical gap by providing a systematic framework for generating spatially coherent data. In this work, we introduce SKG2DATA, a novel multimodal synthesis approach guided by spatial knowledge graphs, grounded in the concept of knowledge-to-data generation. SKG2DATA employs an automated pipeline for constructing Spatial Knowledge Graph (SKG) that effectively captures human-like spatial cognition, including directional and distance relationships. These structured representations then serve as precise guidance for our integrated synthesis pipeline, where a diffusion model generates spatially-consistent images while a MLLM produces corresponding textual descriptions. The automated construction of SKG enables scalable generation of diverse yet realistic spatial configurations, overcoming the limitations of manual data collection and annotation. Extensive experiments demonstrate that data synthesized from diverse types of spatial knowledge, including direction and distance, enhance the spatial perception and reasoning abilities of MLLMs markedly, albeit with a slight cost to their general capabilities. We hope that the idea of knowledge-based data synthesis can advance the development of spatial intelligence. Code is available at https://github.com/zjunlp/Knowledge2Data.
翻译:多模态大语言模型(MLLMs)的最新进展显著提升了其能力;然而,其空间感知能力仍是一个显著局限。为解决这一挑战,多模态数据合成提供了一种有前景的解决方案。但确保合成数据符合空间常识是一项非平凡的任务。我们的方法通过提供一个生成空间一致数据的系统框架,解决了这一关键缺口。在本工作中,我们提出了SKG2DATA,一种基于知识到数据生成概念、由空间知识图谱引导的新型多模态合成方法。SKG2DATA采用自动化流程构建空间知识图谱(SKG),有效捕捉类人空间认知,包括方向与距离关系。这些结构化表示随后作为我们集成合成流程的精确指导,其中扩散模型生成空间一致的图像,而MLLM生成相应的文本描述。SKG的自动化构建实现了多样化且真实空间配置的可扩展生成,克服了手动数据收集与标注的局限。大量实验表明,基于多种空间知识(包括方向与距离)合成的数据显著增强了MLLMs的空间感知与推理能力,尽管对其通用能力略有影响。我们希望基于知识的数据合成理念能够推动空间智能的发展。代码可在 https://github.com/zjunlp/Knowledge2Data 获取。