基于上下文的47种问答模型在8个多样化数据集上的比较分析 (Comparative Analysis of 47 Context-Based Question Answer Models Across 8 Diverse Datasets)

Context-based question answering (CBQA) models provide more accurate and relevant answers by considering the contextual information. They effectively extract specific information given a context, making them functional in various applications involving user support, information retrieval, and educational platforms. In this manuscript, we benchmarked the performance of 47 CBQA models from Hugging Face on eight different datasets. This study aims to identify the best-performing model across diverse datasets without additional fine-tuning. It is valuable for practical applications where the need to retrain models for specific datasets is minimized, streamlining the implementation of these models in various contexts. The best-performing models were trained on the SQuAD v2 or SQuAD v1 datasets. The best-performing model was ahotrod/electra_large_discriminator_squad2_512, which yielded 43\% accuracy across all datasets. We observed that the computation time of all models depends on the context length and the model size. The model's performance usually decreases with an increase in the answer length. Moreover, the model's performance depends on the context complexity. We also used the Genetic algorithm to improve the overall accuracy by integrating responses from other models. ahotrod/electra_large_discriminator_squad2_512 generated the best results for bioasq10b-factoid (65.92\%), biomedical\_cpgQA (96.45\%), QuAC (11.13\%), and Question Answer Dataset (41.6\%). Bert-large-uncased-whole-word-masking-finetuned-squad achieved an accuracy of 82\% on the IELTS dataset.

翻译：基于上下文的问答模型通过考虑上下文信息提供更准确和相关的答案。这些模型能够根据给定上下文有效提取特定信息，使其在用户支持、信息检索和教育平台等各种应用中发挥功能。在本研究中，我们在八个不同数据集上对来自Hugging Face的47种CBQA模型进行了性能基准测试。本研究旨在识别在不同数据集上表现最佳的模型，而无需额外微调。这对于实际应用具有重要价值，可最大限度地减少针对特定数据集重新训练模型的需求，从而简化这些模型在各种场景中的部署。表现最佳的模型均在SQuAD v2或SQuAD v1数据集上进行训练。最佳性能模型为ahotrod/electra_large_discriminator_squad2_512，在所有数据集上取得了43%的准确率。我们观察到所有模型的计算时间取决于上下文长度和模型规模。模型性能通常随答案长度的增加而下降。此外，模型性能还取决于上下文复杂度。我们还采用遗传算法通过整合其他模型的响应来提高整体准确率。ahotrod/electra_large_discriminator_squad2_512在bioasq10b-factoid（65.92%）、biomedical_cpgQA（96.45%）、QuAC（11.13%）和Question Answer Dataset（41.6%）数据集上取得了最佳结果。Bert-large-uncased-whole-word-masking-finetuned-squad在IELTS数据集上达到了82%的准确率。

相关内容

数据集

关注 0

数据集，又称为资料集、数据集合或资料集合，是一种由数据所组成的集合。
Data set（或dataset）是一个数据的集合，通常以表格形式出现。每一列代表一个特定变量。每一行都对应于某一成员的数据集的问题。它列出的价值观为每一个变量，如身高和体重的一个物体或价值的随机数。每个数值被称为数据资料。对应于行数，该数据集的数据可能包括一个或多个成员。

DeepSeek模型综述：V1 V2 V3 R1-Zero

专知会员服务

116+阅读 · 2月11日

【NeurIPS2020】可处理的反事实推理的深度结构因果模型

专知会员服务

49+阅读 · 2020年9月28日

Time2Vec：学习时间的向量表示，Time2Vec: Learning a Vector Representation of Time

专知会员服务

36+阅读 · 2020年5月10日

语义相似性算法演化论文，29页pdf，Evolution of Semantic Similarity - A Survey

专知会员服务

44+阅读 · 2020年4月30日