We propose FlowRL: matching the full reward distribution via flow balancing instead of maximizing rewards in large language model (LLM) reinforcement learning (RL). Recent advanced reasoning models adopt reward-maximizing methods (\eg, PPO and GRPO), which tend to over-optimize dominant reward signals while neglecting less frequent but valid reasoning paths, thus reducing diversity. In contrast, we transform scalar rewards into a normalized target distribution using a learnable partition function, and then minimize the reverse KL divergence between the policy and the target distribution. We implement this idea as a flow-balanced optimization method that promotes diverse exploration and generalizable reasoning trajectories. We conduct experiments on math and code reasoning tasks: FlowRL achieves a significant average improvement of $10.0\%$ over GRPO and $5.1\%$ over PPO on math benchmarks, and performs consistently better on code reasoning tasks. These results highlight reward distribution-matching as a key step toward efficient exploration and diverse reasoning in LLM reinforcement learning.
翻译:我们提出FlowRL:在大语言模型(LLM)强化学习(RL)中,通过流平衡匹配完整的奖励分布,而非最大化奖励。近期先进的推理模型采用奖励最大化方法(例如PPO和GRPO),这些方法倾向于过度优化主导性奖励信号,而忽视出现频率较低但有效的推理路径,从而降低了多样性。相比之下,我们使用可学习的配分函数将标量奖励转换为归一化的目标分布,然后最小化策略与目标分布之间的反向KL散度。我们将这一思想实现为一种流平衡优化方法,以促进多样化的探索和可泛化的推理轨迹。我们在数学和代码推理任务上进行了实验:FlowRL在数学基准测试中相比GRPO平均显著提升$10.0\\%$,相比PPO提升$5.1\\%$,并在代码推理任务中持续表现更优。这些结果突显了奖励分布匹配是实现LLM强化学习中高效探索和多样化推理的关键步骤。