AutoDDG：基于大型语言模型的数据集描述自动生成框架 (AutoDDG: Automated Dataset Description Generation using Large Language Models)

The proliferation of datasets across open data portals and enterprise data lakes presents an opportunity for deriving data-driven insights. Widely-used dataset search systems rely on keyword search over dataset metadata, including descriptions, to support discovery. Therefore, when these descriptions are incomplete, missing, or inconsistent with dataset contents, findability is severely compromised. To improve findability, we introduce AutoDDG, a framework that automatically generates descriptions of tabular data. By adopting a data-driven approach to summarize dataset contents and leveraging large language models (LLMs) to enrich summaries with semantic information and produce human-readable text, AutoDDG derives descriptions that are comprehensive, accurate, readable, and concise. A critical challenge in this problem is evaluating the effectiveness of description generation methods and assessing the quality of the generated descriptions. We propose a comprehensive evaluation methodology that combines retrieval, reference-based, and reference-free assessment, with human validation. Our experimental results using new benchmarks demonstrate that AutoDDG generates high-quality, accurate descriptions at scale, significantly improving dataset retrieval performance across diverse use cases. AutoDDG is publicly available at https://github.com/VIDA-NYU/AutoDDG.

翻译：开放数据门户和企业数据湖中数据集的激增为获取数据驱动的洞察提供了机遇。广泛使用的数据集检索系统依赖于对数据集元数据（包括描述文本）的关键词搜索来支持发现。因此，当这些描述不完整、缺失或与数据集内容不一致时，可发现性将受到严重损害。为提高可发现性，我们提出了AutoDDG框架，该框架能够自动生成表格数据的描述。通过采用数据驱动的方法总结数据集内容，并利用大型语言模型（LLMs）以语义信息丰富摘要并生成人类可读的文本，AutoDDG生成的描述具有全面性、准确性、可读性和简洁性。该问题中的一个关键挑战在于评估描述生成方法的有效性以及生成描述的质量。我们提出了一种综合评估方法，结合了检索评估、基于参考的评估和无参考评估，并辅以人工验证。基于新基准的实验结果表明，AutoDDG能够大规模生成高质量、准确的描述，显著提升了多样化应用场景下的数据集检索性能。AutoDDG已在https://github.com/VIDA-NYU/AutoDDG公开提供。

相关内容

数据集

关注 0

数据集，又称为资料集、数据集合或资料集合，是一种由数据所组成的集合。
Data set（或dataset）是一个数据的集合，通常以表格形式出现。每一列代表一个特定变量。每一行都对应于某一成员的数据集的问题。它列出的价值观为每一个变量，如身高和体重的一个物体或价值的随机数。每个数值被称为数据资料。对应于行数，该数据集的数据可能包括一个或多个成员。

【TPAMI2022】关联关系驱动的多模态分类，AF: An Association-based Fusion Method for Multi-Modal Classification

专知会员服务

27+阅读 · 2022年3月22日

【CVPR 2022】长尾视觉数据识别的嵌套式协同学习方法 Nested Collaborative Learning for Long-Tailed Visual Recognition

专知会员服务

13+阅读 · 2022年3月19日

【Mila-Google】使用元学习动态调整源代码模型，On-the-Fly Adaptation of Source Code Models using Meta-Learning

专知会员服务

21+阅读 · 2020年3月28日

【Google 大脑】使用上千个优化任务学习超参数搜索策略，Using a thousand optimization tasks to learn hyperparameter search strategies

专知会员服务

18+阅读 · 2020年3月14日