视觉语言模型是否像人类一样理解可视化？在图表分类任务中的对齐性研究 (Do Vision-Language Models See Visualizations Like Humans? Alignment in Chart Categorization)

Vision-language models (VLMs) hold promise for enhancing visualization tools, but effective human-AI collaboration hinges on a shared perceptual understanding of visual content. Prior studies assessed VLM visualization literacy through interpretive tasks, revealing an over-reliance on textual cues rather than genuine visual analysis. Our study investigates a more foundational skill underpinning such literacy: the ability of VLMs to recognize a chart's core visual properties as humans do. We task 13 diverse VLMs with classifying scientific visualizations based solely on visual stimuli, according to three criteria: purpose (e.g., schematic, GUI, visualization), encoding (e.g., bar, point, node-link), and dimensionality (e.g., 2D, 3D). Using expert labels from the human-centric VisType typology as ground truth, we find that VLMs often identify purpose and dimensionality accurately but struggle with specific encoding types. Our preliminary results show that larger models do not always equate to superior performance and highlight the need for careful integration of VLMs in visualization tasks, with human supervision to ensure reliable outcomes.

翻译：视觉语言模型（VLMs）在增强可视化工具方面展现出潜力，但有效的人机协作依赖于对视觉内容的共享感知理解。先前研究通过解释性任务评估了VLM的可视化素养，揭示了其过度依赖文本线索而非真正的视觉分析。本研究探讨了支撑这种素养的一项更基础的能力：VLM能否像人类一样识别图表的核心视觉属性。我们让13个不同的VLM仅基于视觉刺激，根据三个标准对科学可视化进行分类：目的（例如示意图、图形用户界面、可视化）、编码方式（例如条形图、散点图、节点链接图）和维度（例如二维、三维）。以基于人类中心的VisType分类法专家标注作为基准真值，我们发现VLM通常能准确识别目的和维度，但在特定编码类型上存在困难。初步结果表明，更大的模型并不总是意味着更优的性能，并强调了在可视化任务中需要谨慎整合VLM，并通过人工监督以确保可靠结果。