修正评估偏好：通过困惑度感知强化学习提升大语言模型在数学推理上的评判能力 (Rectify Evaluation Preference: Improving LLMs' Critique on Math Reasoning via Perplexity-aware Reinforcement Learning)

To improve Multi-step Mathematical Reasoning (MsMR) of Large Language Models (LLMs), it is crucial to obtain scalable supervision from the corpus by automatically critiquing mistakes in the reasoning process of MsMR and rendering a final verdict of the problem-solution. Most existing methods rely on crafting high-quality supervised fine-tuning demonstrations for critiquing capability enhancement and pay little attention to delving into the underlying reason for the poor critiquing performance of LLMs. In this paper, we orthogonally quantify and investigate the potential reason -- imbalanced evaluation preference, and conduct a statistical preference analysis. Motivated by the analysis of the reason, a novel perplexity-aware reinforcement learning algorithm is proposed to rectify the evaluation preference, elevating the critiquing capability. Specifically, to probe into LLMs' critiquing characteristics, a One-to-many Problem-Solution (OPS) benchmark is meticulously constructed to quantify the behavior difference of LLMs when evaluating the problem solutions generated by itself and others. Then, to investigate the behavior difference in depth, we conduct a statistical preference analysis oriented on perplexity and find an intriguing phenomenon -- ``LLMs incline to judge solutions with lower perplexity as correct'', which is dubbed as \textit{imbalanced evaluation preference}. To rectify this preference, we regard perplexity as the baton in the algorithm of Group Relative Policy Optimization, supporting the LLMs to explore trajectories that judge lower perplexity as wrong and higher perplexity as correct. Extensive experimental results on our built OPS and existing available critic benchmarks demonstrate the validity of our method.

翻译：为提升大语言模型（LLMs）的多步数学推理（MsMR）能力，关键在于从语料库中获取可扩展的监督，即自动评判MsMR推理过程中的错误并对问题解决方案给出最终判定。现有方法大多依赖构建高质量的监督微调样本来增强评判能力，而较少深入探究LLMs评判性能不佳的根本原因。本文正交地量化并研究了潜在原因——不平衡的评估偏好，并进行了统计偏好分析。基于该原因的分析，我们提出了一种新颖的困惑度感知强化学习算法来修正评估偏好，从而提升评判能力。具体而言，为探究LLMs的评判特性，我们精心构建了一个“一对多问题-解决方案”（OPS）基准，以量化LLMs在评估自身生成与他人生成的问题解决方案时的行为差异。随后，为深入探究该行为差异，我们进行了面向困惑度的统计偏好分析，发现了一个有趣现象——“LLMs倾向于将困惑度较低的解决方案判定为正确”，这被称为\textit{不平衡评估偏好}。为修正此偏好，我们将困惑度视为“指挥棒”融入分组相对策略优化算法中，引导LLMs探索将低困惑度方案判为错误、高困惑度方案判为正确的轨迹。在我们构建的OPS基准及现有可用的评判基准上的大量实验结果验证了本方法的有效性。