Vision language models (VLMs) can simultaneously reason about images and texts to tackle many tasks, from visual question answering to image captioning. This paper focuses on map parsing, a novel task that is unexplored within the VLM context and particularly useful to mobile robots. Map parsing requires understanding not only the labels but also the geometric configurations of a map, i.e., what areas are like and how they are connected. To evaluate the performance of VLMs on map parsing, we prompt VLMs with floor plan maps to generate task plans for complex indoor navigation. Our results demonstrate the remarkable capability of VLMs in map parsing, with a success rate of 0.96 in tasks requiring a sequence of nine navigation actions, e.g., approaching and going through doors. Other than intuitive observations, e.g., VLMs do better in smaller maps and simpler navigation tasks, there was a very interesting observation that its performance drops in large open areas. We provide practical suggestions to address such challenges as validated by our experimental results. Webpage: https://sites.google.com/view/vlm-floorplan/
翻译:视觉语言模型(VLMs)能够同时对图像和文本进行推理,以处理从视觉问答到图像描述生成等多种任务。本文聚焦于地图解析这一在VLM背景下尚未被探索、且对移动机器人尤为有用的新任务。地图解析不仅需要理解地图的标注信息,还需理解其几何构型,即区域的特性及其相互连接方式。为评估VLMs在地图解析上的性能,我们通过向VLMs输入平面布局图来生成复杂室内导航的任务规划。实验结果表明,VLMs在地图解析方面展现出卓越能力,在需要连续执行九步导航动作(例如接近并穿过门)的任务中取得了0.96的成功率。除了一些直观的观察结论(例如VLMs在较小地图和较简单导航任务中表现更好),我们还发现一个非常有趣的现象:其在大型开放区域中的性能会下降。我们基于实验结果验证,提出了应对此类挑战的实用建议。项目网页:https://sites.google.com/view/vlm-floorplan/