Information retrieval (IR) is the task of finding relevant documents in response to a user query. Although Spanish is the second most spoken native language, there are few Spanish IR datasets, which limits the development of information access tools for Spanish speakers. We introduce MessIRve, a large-scale Spanish IR dataset with almost 700,000 queries from Google's autocomplete API and relevant documents sourced from Wikipedia. MessIRve's queries reflect diverse Spanish-speaking regions, unlike other datasets that are translated from English or do not consider dialectal variations. The large size of the dataset allows it to cover a wide variety of topics, unlike smaller datasets. We provide a comprehensive description of the dataset, comparisons with existing datasets, and baseline evaluations of prominent IR models. Our contributions aim to advance Spanish IR research and improve information access for Spanish speakers.
翻译:信息检索(IR)是根据用户查询查找相关文档的任务。尽管西班牙语是第二大母语,但西班牙语IR数据集稀缺,这限制了为西班牙语使用者开发信息访问工具。我们提出了MessIRve,一个大规模西班牙语IR数据集,包含来自谷歌自动补全API的近70万条查询以及从维基百科获取的相关文档。与从英语翻译而来或未考虑方言变体的其他数据集不同,MessIRve的查询反映了多样化的西班牙语地区。该数据集的庞大规模使其能够覆盖广泛的主题,而小型数据集则无法做到。我们提供了数据集的全面描述、与现有数据集的比较,以及对主流IR模型的基线评估。我们的贡献旨在推动西班牙语IR研究并改善西班牙语使用者的信息访问。