Term weighting schemes are widely used in Natural Language Processing and Information Retrieval. In particular, term weighting is the basis for keyword extraction. However, there are relatively few evaluation studies that shed light about the strengths and shortcomings of each weighting scheme. In fact, in most cases researchers and practitioners resort to the well-known tf-idf as default, despite the existence of other suitable alternatives, including graph-based models. In this paper, we perform an exhaustive and large-scale empirical comparison of both statistical and graph-based term weighting methods in the context of keyword extraction. Our analysis reveals some interesting findings such as the advantages of the less-known lexical specificity with respect to tf-idf, or the qualitative differences between statistical and graph-based methods. Finally, based on our findings we discuss and devise some suggestions for practitioners. We release our code at https://github.com/asahi417/kex .
翻译:在自然语言处理和信息检索中广泛使用术语加权办法,特别是术语加权是关键词提取的基础,然而,相对而言,只有较少的评价研究揭示了每个加权办法的长处和短处,事实上,在多数情况下,研究人员和从业者诉诸众所周知的tf-idf默认违约,尽管存在其他合适的替代方法,包括基于图表的模型。在本文件中,我们对关键词提取中的统计和基于图表的术语加权方法进行了详尽而大规模的实证比较。我们的分析揭示了一些令人感兴趣的结果,例如不太为人所知的词汇特性对tf-idf的优点,或统计方法和基于图表的方法之间的质的差异。最后,我们根据我们的调查结果讨论并设计了一些供从业者使用的建议。我们在https://github.com/asah417/kex上发布了我们的代码。