Text-driven 3D scene generation holds promise for a wide range of applications, from virtual prototyping to AR/VR and simulation. However, existing methods are often constrained to single-object generation, require domain-specific training, or lack support for full 360-degree viewability. In this work, we present a training-free approach to 3D scene synthesis by repurposing general-purpose text-to-3D object diffusion models as modular tile generators. We reformulate scene generation as a multi-tile denoising problem, where overlapping 3D regions are independently generated and seamlessly blended via weighted averaging. This enables scalable synthesis of large, coherent scenes while preserving local semantic control. Our method eliminates the need for scene-level datasets or retraining, relies on minimal heuristics, and inherits the generalization capabilities of object-level priors. We demonstrate that our approach supports diverse scene layouts, efficient generation, and flexible editing, establishing a simple yet powerful foundation for general-purpose, language-driven 3D scene construction.
翻译:文本驱动的三维场景生成在虚拟原型设计、增强/虚拟现实及仿真等领域具有广阔的应用前景。然而,现有方法通常局限于单物体生成,需要特定领域训练,或缺乏对完整360度可视化的支持。本研究提出一种免训练的三维场景合成方法,通过将通用文本到三维物体扩散模型重构为模块化区块生成器来实现。我们将场景生成重新定义为多区块去噪问题,其中重叠的三维区域被独立生成,并通过加权平均实现无缝融合。该方法能够实现大规模连贯场景的可扩展合成,同时保持局部语义控制。我们的方法无需场景级数据集或重新训练,仅依赖最小启发式规则,并继承了物体级先验的泛化能力。实验表明,该方法支持多样化的场景布局、高效生成和灵活编辑,为通用语言驱动的三维场景构建奠定了简洁而强大的基础。