This paper presents Omni-View, which extends the unified multimodal understanding and generation to 3D scenes based on multiview images, exploring the principle that "generation facilitates understanding". Consisting of understanding model, texture module, and geometry module, Omni-View jointly models scene understanding, novel view synthesis, and geometry estimation, enabling synergistic interaction between 3D scene understanding and generation tasks. By design, it leverages the spatiotemporal modeling capabilities of its texture module responsible for appearance synthesis, alongside the explicit geometric constraints provided by its dedicated geometry module, thereby enriching the model's holistic understanding of 3D scenes. Trained with a two-stage strategy, Omni-View achieves a state-of-the-art score of 55.4 on the VSI-Bench benchmark, outperforming existing specialized 3D understanding models, while simultaneously delivering strong performance in both novel view synthesis and 3D scene generation.
翻译:本文提出Omni-View,将统一的多模态理解与生成扩展至基于多视角图像的三维场景,探索“生成促进理解”的原理。Omni-View由理解模型、纹理模块和几何模块组成,联合建模场景理解、新视角合成与几何估计,实现三维场景理解与生成任务的协同交互。通过设计,它利用负责外观合成的纹理模块的时空建模能力,结合专用几何模块提供的显式几何约束,从而增强模型对三维场景的整体理解。采用两阶段训练策略,Omni-View在VSI-Bench基准测试中获得55.4分的先进性能,超越现有专用三维理解模型,同时在新视角合成和三维场景生成任务中均表现出色。