MessIRve：一个大规模西班牙语信息检索数据集 (MessIRve: A Large-Scale Spanish Information Retrieval Dataset)

Information retrieval (IR) is the task of finding relevant documents in response to a user query. Although Spanish is the second most spoken native language, there are few Spanish IR datasets, which limits the development of information access tools for Spanish speakers. We introduce MessIRve, a large-scale Spanish IR dataset with almost 700,000 queries from Google's autocomplete API and relevant documents sourced from Wikipedia. MessIRve's queries reflect diverse Spanish-speaking regions, unlike other datasets that are translated from English or do not consider dialectal variations. The large size of the dataset allows it to cover a wide variety of topics, unlike smaller datasets. We provide a comprehensive description of the dataset, comparisons with existing datasets, and baseline evaluations of prominent IR models. Our contributions aim to advance Spanish IR research and improve information access for Spanish speakers.

翻译：信息检索（IR）是根据用户查询查找相关文档的任务。尽管西班牙语是第二大母语，但西班牙语IR数据集稀缺，这限制了为西班牙语使用者开发信息访问工具。我们提出了MessIRve，一个大规模西班牙语IR数据集，包含来自谷歌自动补全API的近70万条查询以及从维基百科获取的相关文档。与从英语翻译而来或未考虑方言变体的其他数据集不同，MessIRve的查询反映了多样化的西班牙语地区。该数据集的庞大规模使其能够覆盖广泛的主题，而小型数据集则无法做到。我们提供了数据集的全面描述、与现有数据集的比较，以及对主流IR模型的基线评估。我们的贡献旨在推动西班牙语IR研究并改善西班牙语使用者的信息访问。

相关内容

数据集

关注 88

数据集，又称为资料集、数据集合或资料集合，是一种由数据所组成的集合。
Data set（或dataset）是一个数据的集合，通常以表格形式出现。每一列代表一个特定变量。每一行都对应于某一成员的数据集的问题。它列出的价值观为每一个变量，如身高和体重的一个物体或价值的随机数。每个数值被称为数据资料。对应于行数，该数据集的数据可能包括一个或多个成员。

FlowQA: Grasping Flow in History for Conversational Machine Comprehension

专知会员服务

34+阅读 · 2019年10月18日

Auto-Sizing the Transformer Network: Improving Speed, Efficiency, and Performance for Low-Resource Machine Translation

专知会员服务

50+阅读 · 2019年10月17日

Connections between Support Vector Machines, Wasserstein distance and gradient-penalty GANs

专知会员服务

36+阅读 · 2019年10月17日

Stabilizing Transformers for Reinforcement Learning

专知会员服务

60+阅读 · 2019年10月17日