Large-scale supervised data is essential for training modern ranking models, but obtaining high-quality human annotations is costly. Click data has been widely used as a low-cost alternative, and with recent advances in large language models (LLMs), LLM-based relevance annotation has emerged as another promising annotation. This paper investigates whether LLM annotations can replace click data for learning to rank (LTR) by conducting a comprehensive comparison across multiple dimensions. Experiments on both a public dataset, TianGong-ST, and an industrial dataset, Baidu-Click, show that click-supervised models perform better on high-frequency queries, while LLM annotation-supervised models are more effective on medium- and low-frequency queries. Further analysis shows that click-supervised models are better at capturing document-level signals such as authority or quality, while LLM annotation-supervised models are more effective at modeling semantic matching between queries and documents and at distinguishing relevant from non-relevant documents. Motivated by these observations, we explore two training strategies -- data scheduling and frequency-aware multi-objective learning -- that integrate both supervision signals. Both approaches enhance ranking performance across queries at all frequency levels, with the latter being more effective. Our code is available at https://github.com/Trustworthy-Information-Access/LLMAnn_Click.
翻译:大规模监督数据对于训练现代排序模型至关重要,但获取高质量的人工标注成本高昂。点击数据已被广泛用作一种低成本替代方案,而随着大语言模型(LLMs)的最新进展,基于LLM的相关性标注已成为另一种有前景的标注方式。本文通过多维度综合比较,研究LLM标注是否能替代点击数据用于排序学习(LTR)。在公开数据集TianGong-ST和工业数据集Baidu-Click上的实验表明:点击监督模型在高频查询上表现更优,而LLM标注监督模型在中低频查询上更为有效。进一步分析显示,点击监督模型更擅长捕捉文档级信号(如权威性或质量),而LLM标注监督模型在建模查询与文档间的语义匹配以及区分相关与非相关文档方面更具优势。基于这些观察,我们探索了两种整合两种监督信号的训练策略——数据调度和频率感知多目标学习。两种方法均提升了所有频率层级查询的排序性能,其中后者效果更显著。我们的代码公开于https://github.com/Trustworthy-Information-Access/LLMAnn_Click。