The proliferation of datasets across open data portals and enterprise data lakes presents an opportunity for deriving data-driven insights. Widely-used dataset search systems rely on keyword search over dataset metadata, including descriptions, to support discovery. Therefore, when these descriptions are incomplete, missing, or inconsistent with dataset contents, findability is severely compromised. To improve findability, we introduce AutoDDG, a framework that automatically generates descriptions of tabular data. By adopting a data-driven approach to summarize dataset contents and leveraging large language models (LLMs) to enrich summaries with semantic information and produce human-readable text, AutoDDG derives descriptions that are comprehensive, accurate, readable, and concise. A critical challenge in this problem is evaluating the effectiveness of description generation methods and assessing the quality of the generated descriptions. We propose a comprehensive evaluation methodology that combines retrieval, reference-based, and reference-free assessment, with human validation. Our experimental results using new benchmarks demonstrate that AutoDDG generates high-quality, accurate descriptions at scale, significantly improving dataset retrieval performance across diverse use cases. AutoDDG is publicly available at https://github.com/VIDA-NYU/AutoDDG.
翻译:开放数据门户和企业数据湖中数据集的激增为获取数据驱动的洞察提供了机遇。广泛使用的数据集检索系统依赖于对数据集元数据(包括描述文本)的关键词搜索来支持发现。因此,当这些描述不完整、缺失或与数据集内容不一致时,可发现性将受到严重损害。为提高可发现性,我们提出了AutoDDG框架,该框架能够自动生成表格数据的描述。通过采用数据驱动的方法总结数据集内容,并利用大型语言模型(LLMs)以语义信息丰富摘要并生成人类可读的文本,AutoDDG生成的描述具有全面性、准确性、可读性和简洁性。该问题中的一个关键挑战在于评估描述生成方法的有效性以及生成描述的质量。我们提出了一种综合评估方法,结合了检索评估、基于参考的评估和无参考评估,并辅以人工验证。基于新基准的实验结果表明,AutoDDG能够大规模生成高质量、准确的描述,显著提升了多样化应用场景下的数据集检索性能。AutoDDG已在https://github.com/VIDA-NYU/AutoDDG公开提供。