Large vision-language models (LVLMs) have significantly advanced numerous fields. In this work, we explore how to harness their potential to address 3D scene understanding tasks, using 3D question answering (3D-QA) as a representative example. Due to the limited training data in 3D, we do not train LVLMs but infer in a zero-shot manner. Specifically, we sample 2D views from a 3D point cloud and feed them into 2D models to answer a given question. When the 2D model is chosen, e.g., LLAVA-OV, the quality of sampled views matters the most. We propose cdViews, a novel approach to automatically selecting critical and diverse Views for 3D-QA. cdViews consists of two key components: viewSelector prioritizing critical views based on their potential to provide answer-specific information, and viewNMS enhancing diversity by removing redundant views based on spatial overlap. We evaluate cdViews on the widely-used ScanQA and SQA benchmarks, demonstrating that it achieves state-of-the-art performance in 3D-QA while relying solely on 2D models without fine-tuning. These findings support our belief that 2D LVLMs are currently the most effective alternative (of the resource-intensive 3D LVLMs) for addressing 3D tasks.
翻译:大型视觉语言模型(LVLMs)已在众多领域取得显著进展。本研究探索如何利用其潜力解决三维场景理解任务,以三维问答(3D-QA)作为代表性示例。由于三维训练数据有限,我们不对LVLMs进行训练,而是以零样本方式进行推理。具体而言,我们从三维点云中采样二维视图,并将其输入二维模型以回答给定问题。当选定二维模型(如LLaVA-OV)时,采样视图的质量至关重要。我们提出cdViews,一种自动选择关键且多样化视图以用于3D-QA的新方法。cdViews包含两个关键组件:viewSelector基于视图提供答案特定信息的潜力优先选择关键视图,以及viewNMS通过基于空间重叠移除冗余视图来增强多样性。我们在广泛使用的ScanQA和SQA基准上评估cdViews,证明其在仅依赖二维模型且无需微调的情况下,实现了3D-QA的最先进性能。这些发现支持我们的观点:二维LVLMs目前是解决三维任务最有效的替代方案(相较于资源密集的三维LVLMs)。