Training effective text rerankers is crucial for information retrieval. Two strategies are widely used: contrastive learning (optimizing directly on ground-truth labels) and knowledge distillation (transferring knowledge from a larger reranker). While both have been studied extensively, a clear comparison of their effectiveness for training cross-encoder rerankers under practical conditions is needed. This paper empirically compares these strategies by training rerankers of different sizes (0.5B, 1.5B, 3B, 7B) and architectures (Transformer, Recurrent) using both methods on the same data, with a strong contrastive learning model acting as the distillation teacher. Our results show that knowledge distillation generally yields better in-domain and out-of-domain ranking performance than contrastive learning when distilling from a more performant teacher model. This finding is consistent across student model sizes and architectures. However, distilling from a teacher of the same capacity does not provide the same advantage, particularly for out-of-domain tasks. These findings offer practical guidance for choosing a training strategy based on available teacher models. We recommend using knowledge distillation to train smaller rerankers if a larger, more performant teacher is accessible; in its absence, contrastive learning remains a robust baseline. Our code implementation is made available to facilitate reproducbility.
翻译:训练有效的文本重排序器对于信息检索至关重要。目前广泛采用两种策略:对比学习(直接基于真实标签进行优化)和知识蒸馏(从更大的重排序器迁移知识)。尽管两者均已得到广泛研究,但在实际条件下对它们训练交叉编码器重排序器的有效性进行清晰比较仍有必要。本文通过使用两种方法在同一数据上训练不同规模(0.5B、1.5B、3B、7B)和架构(Transformer、循环网络)的重排序器,并以强对比学习模型作为蒸馏教师,对这两种策略进行了实证比较。结果表明,当从性能更优的教师模型进行蒸馏时,知识蒸馏通常能产生比对比学习更好的领域内和领域外排序性能。这一发现在不同学生模型规模和架构中均保持一致。然而,从相同容量的教师模型进行蒸馏则无法提供同等优势,尤其是在领域外任务中。这些发现为基于可用教师模型选择训练策略提供了实用指导。我们建议:若可获得更大、性能更强的教师模型,则使用知识蒸馏训练较小的重排序器;若无此条件,对比学习仍是稳健的基线方法。我们公开了代码实现以促进可复现性。