Relevance evaluation plays a crucial role in personalized search systems to ensure that search results align with a user's queries and intent. While human annotation is the traditional method for relevance evaluation, its high cost and long turnaround time limit its scalability. In this work, we present our approach at Pinterest Search to automate relevance evaluation for online experiments using fine-tuned LLMs. We rigorously validate the alignment between LLM-generated judgments and human annotations, demonstrating that LLMs can provide reliable relevance measurement for experiments while greatly improving the evaluation efficiency. Leveraging LLM-based labeling further unlocks the opportunities to expand the query set, optimize sampling design, and efficiently assess a wider range of search experiences at scale. This approach leads to higher-quality relevance metrics and significantly reduces the Minimum Detectable Effect (MDE) in online experiment measurements.
翻译:相关性评估在个性化搜索系统中扮演着关键角色,以确保搜索结果与用户的查询及意图相匹配。尽管人工标注是相关性评估的传统方法,但其高昂的成本和较长的周转时间限制了其可扩展性。在本研究中,我们介绍了Pinterest搜索中利用微调LLM实现在线实验相关性评估自动化的方法。我们严格验证了LLM生成判断与人工标注之间的一致性,证明LLM能够为实验提供可靠的相关性度量,同时大幅提升评估效率。基于LLM的标注进一步拓展了扩大查询集、优化抽样设计以及高效评估大规模多样化搜索体验的可能性。该方法不仅提升了相关性指标的质量,还显著降低了在线实验测量中的最小可检测效应(MDE)。