The frequent need for analysts to create visualizations to derive insights from data has driven extensive research into the generation of natural Language to Visualization (NL2VIS). While recent progress in large language models (LLMs) suggests their potential to effectively support NL2VIS tasks, existing studies lack a systematic investigation into the performance of different LLMs under various prompt strategies. This paper addresses this gap and contributes a crucial baseline evaluation of LLMs' capabilities in generating visualization specifications of NL2VIS tasks. Our evaluation utilizes the nvBench dataset, employing six representative LLMs and eight distinct prompt strategies to evaluate their performance in generating six target chart types using the Vega-Lite visualization specification. We assess model performance with multiple metrics, including vis accuracy, validity and legality. Our results reveal substantial performance disparities across prompt strategies, chart types, and LLMs. Furthermore, based on the evaluation results, we uncover several counterintuitive behaviors across these dimensions, and propose directions for enhancing the NL2VIS benchmark to better support future NL2VIS research.
翻译:分析师频繁需要创建可视化以从数据中获取洞察,这推动了自然语言到可视化(NL2VIS)生成领域的广泛研究。尽管大型语言模型(LLMs)的最新进展表明其具备有效支持NL2VIS任务的潜力,但现有研究缺乏对不同LLMs在多种提示策略下性能的系统性考察。本文填补了这一空白,并对LLMs在NL2VIS任务中生成可视化规范的能力提供了关键基准评估。我们的评估采用nvBench数据集,运用六种代表性LLMs和八种不同的提示策略,评估其在生成六种目标图表类型(使用Vega-Lite可视化规范)时的表现。我们通过可视化准确度、有效性与合法性等多重指标评估模型性能。研究结果显示,不同提示策略、图表类型及LLMs之间存在显著的性能差异。此外,基于评估结果,我们揭示了这些维度上若干反直觉的行为模式,并提出了改进NL2VIS基准的方向,以更好地支持未来NL2VIS研究。