Query-product relevance prediction is fundamental to e-commerce search and has become even more critical in the era of AI-powered shopping, where semantic understanding and complex reasoning directly shape the user experience and business conversion. Large Language Models (LLMs) enable generative, reasoning-based approaches, typically aligned via supervised fine-tuning (SFT) or preference optimization methods like Direct Preference Optimization (DPO). However, the increasing complexity of business rules and user queries exposes the inability of existing methods to endow models with robust reasoning capacity for long-tail and challenging cases. Efforts to address this via reinforcement learning strategies like Group Relative Policy Optimization (GRPO) often suffer from sparse terminal rewards, offering insufficient guidance for multi-step reasoning and slowing convergence. To address these challenges, we propose TaoSR-AGRL, an Adaptive Guided Reinforcement Learning framework for LLM-based relevance prediction in Taobao Search Relevance. TaoSR-AGRL introduces two key innovations: (1) Rule-aware Reward Shaping, which decomposes the final relevance judgment into dense, structured rewards aligned with domain-specific relevance criteria; and (2) Adaptive Guided Replay, which identifies low-accuracy rollouts during training and injects targeted ground-truth guidance to steer the policy away from stagnant, rule-violating reasoning patterns toward compliant trajectories. TaoSR-AGRL was evaluated on large-scale real-world datasets and through online side-by-side human evaluations on Taobao Search. It consistently outperforms DPO and standard GRPO baselines in offline experiments, improving relevance accuracy, rule adherence, and training stability. The model trained with TaoSR-AGRL has been successfully deployed in the main search scenario on Taobao, serving hundreds of millions of users.
翻译:查询-商品相关性预测是电商搜索的基础,在人工智能驱动的购物时代变得尤为关键,其中语义理解和复杂推理直接影响用户体验和商业转化。大型语言模型(LLMs)支持基于生成和推理的方法,通常通过监督微调(SFT)或直接偏好优化(DPO)等偏好优化方法进行对齐。然而,业务规则和用户查询日益复杂,暴露出现有方法无法为长尾和挑战性案例赋予模型稳健的推理能力。通过强化学习策略(如组相对策略优化(GRPO))解决此问题的尝试常受限于稀疏的终端奖励,无法为多步推理提供充分指导,并减缓收敛速度。为应对这些挑战,我们提出TaoSR-AGRL,一个用于淘宝搜索相关性中基于LLM的相关性预测的自适应引导强化学习框架。TaoSR-AGRL引入两项关键创新:(1)规则感知奖励塑造,将最终相关性判断分解为与领域特定相关性标准对齐的密集结构化奖励;(2)自适应引导回放,在训练中识别低准确率的推演轨迹,并注入针对性的真实指导,引导策略从停滞、违反规则的推理模式转向合规的轨迹。TaoSR-AGRL在大规模真实世界数据集上进行了评估,并通过淘宝搜索的在线并行人工评估。在离线实验中,它持续优于DPO和标准GRPO基线,提升了相关性准确性、规则遵循性和训练稳定性。使用TaoSR-AGRL训练的模型已成功部署在淘宝主搜索场景中,为数亿用户提供服务。