Current Visual Simultaneous Localization and Mapping (VSLAM) systems often struggle to create maps that are both semantically rich and easily interpretable. While incorporating semantic scene knowledge aids in building richer maps with contextual associations among mapped objects, representing them in structured formats, such as scene graphs, has not been widely addressed, resulting in complex map comprehension and limited scalability. This paper introduces vS-Graphs, a novel real-time VSLAM framework that integrates vision-based scene understanding with map reconstruction and comprehensible graph-based representation. The framework infers structural elements (i.e., rooms and floors) from detected building components (i.e., walls and ground surfaces) and incorporates them into optimizable 3D scene graphs. This solution enhances the reconstructed map's semantic richness, comprehensibility, and localization accuracy. Extensive experiments on standard benchmarks and real-world datasets demonstrate that vS-Graphs achieves an average of 15.22% accuracy gain across all tested datasets compared to state-of-the-art VSLAM methods. Furthermore, the proposed framework achieves environment-driven semantic entity detection accuracy comparable to that of precise LiDAR-based frameworks, using only visual features. The code is publicly available at https://github.com/snt-arg/visual_sgraphs and is actively being improved. Moreover, a web page containing more media and evaluation outcomes is available on https://snt-arg.github.io/vsgraphs-results/.
翻译:当前的视觉同时定位与建图(VSLAM)系统常难以构建既语义丰富又易于解释的地图。尽管融入语义场景知识有助于构建具有映射对象间上下文关联的更丰富地图,但以结构化格式(如场景图)表示这些信息尚未得到广泛解决,导致地图理解复杂且可扩展性有限。本文提出vS-Graphs,一种新颖的实时VSLAM框架,将基于视觉的场景理解与地图重建及可理解的基于图的表示相结合。该框架从检测到的建筑组件(即墙壁和地面)推断结构元素(如房间和楼层),并将其纳入可优化的三维场景图中。此方案提升了重建地图的语义丰富性、可理解性及定位精度。在标准基准和真实世界数据集上的大量实验表明,与最先进的VSLAM方法相比,vS-Graphs在所有测试数据集上平均实现了15.22%的精度提升。此外,所提框架仅使用视觉特征,实现了与基于精确激光雷达框架相当的环境驱动语义实体检测精度。代码公开于https://github.com/snt-arg/visual_sgraphs并持续改进。更多媒体和评估结果可访问https://snt-arg.github.io/vsgraphs-results/。