Policy gradient algorithms have been successfully applied to enhance the reasoning capabilities of large language models (LLMs). KL regularization is ubiquitous, yet the design surface, choice of KL direction (forward vs. reverse), normalization (normalized vs. unnormalized), and estimator ($k_1/k_2/k_3$), is scattered across the literature and often intertwined with off-policy estimation. We ask a focused question: under the off-policy setting, what weighting is required for each KL variant so that the surrogate we optimize yields the exact gradient of the intended KL-regularized objective? We answer this with a compact, unified derivation we call the Regularized Policy Gradient (RPG) view. RPG (i) unifies normalized and unnormalized KL variants and shows that the widely-used $k_3$ penalty is exactly the unnormalized KL; (ii) specifies conditions under which REINFORCE-style losses with stop-gradient are gradient-equivalent to fully differentiable surrogates; (iii) identifies and corrects an off-policy importance-weighting mismatch in GRPO's KL term; and (iv) introduces RPG-Style Clip, a clipped-importance-sampling step within RPG-REINFORCE that enables stable, off-policy policy-gradient training at scale. On mathematical reasoning benchmarks (AIME24, AIME25), RPG-REINFORCE with RPG-Style Clip improves accuracy by up to $+6$ absolute percentage points over DAPO. We extend our experiments to 8K context length, and RPG-REINFORCE with RPG-Style Clip achieves 52% accuracy on AIME25, surpassing the official Qwen3-4B-Instruct model (47%). Notably, RPG is a stable and scalable RL algorithm for LLM reasoning, realized via (a) a KL-correct objective, (b) clipped importance sampling, and (c) an iterative reference-policy update scheme.
翻译:策略梯度算法已成功应用于增强大语言模型(LLMs)的推理能力。KL正则化无处不在,但其设计层面、KL方向选择(前向与反向)、归一化(归一化与非归一化)以及估计器($k_1/k_2/k_3$)在文献中分散且常与离策略估计交织。我们提出一个聚焦问题:在离策略设置下,每种KL变体需要何种权重,才能使优化的代理函数产生目标KL正则化目标的精确梯度?我们通过一个简洁、统一的推导——称为正则化策略梯度(RPG)视角——来回答此问题。RPG(i)统一了归一化和非归一化KL变体,并证明广泛使用的$k_3$惩罚项正是非归一化KL;(ii)明确了在何种条件下,带有停止梯度的REINFORCE式损失与完全可微的代理函数梯度等价;(iii)识别并修正了GRPO中KL项的离策略重要性权重不匹配问题;(iv)引入了RPG-Style Clip,这是在RPG-REINFORCE框架内的裁剪重要性采样步骤,实现了大规模、稳定的离策略策略梯度训练。在数学推理基准测试(AIME24、AIME25)中,采用RPG-Style Clip的RPG-REINFORCE相比DAPO将准确率绝对提升了高达$+6$个百分点。我们将实验扩展到8K上下文长度,采用RPG-Style Clip的RPG-REINFORCE在AIME25上达到52%的准确率,超越了官方Qwen3-4B-Instruct模型(47%)。值得注意的是,RPG是一种稳定且可扩展的LLM推理强化学习算法,通过(a)KL校正目标、(b)裁剪重要性采样和(c)迭代参考策略更新方案实现。