Process reward models (PRMs) that provide dense, step-level feedback have shown promise for reinforcement learning, yet their adoption remains limited by the need for expensive step-level annotations or ground truth references. We propose SPARK: a three-stage framework where in the first stage a generator model produces diverse solutions and a verifier model evaluates them using parallel scaling (self-consistency) and sequential scaling (meta-critique). In the second stage, we use these verification outputs as synthetic training data to fine-tune generative process reward models, which subsequently serve as reward signals during training. We show that aggregating multiple independent verifications at the step level produces training data for process reward models that surpass ground-truth outcome supervision, achieving 67.5 F1 on ProcessBench (a benchmark for identifying erroneous steps in mathematical reasoning) compared to 66.4 for reference-guided training and 61.9 for GPT-4o. In the final stage, we apply our generative PRM with chain-of-thought verification (PRM-CoT) as the reward model in RL experiments on mathematical reasoning, and introduce format constraints to prevent reward hacking. Using Qwen2.5-Math-7B, we achieve 47.4% average accuracy across six mathematical reasoning benchmarks, outperforming ground-truth-based RLVR (43.9%). Our work enables reference-free RL training that exceeds ground-truth methods, opening new possibilities for domains lacking verifiable answers or accessible ground truth.
翻译:提供密集步骤级反馈的过程奖励模型(PRMs)在强化学习中展现出潜力,但其应用仍受限于昂贵的步骤级标注或真实参考数据的需求。本文提出SPARK框架:第一阶段,生成模型产生多样化解决方案,验证模型通过并行扩展(自洽性)和序列扩展(元批判)进行评估。第二阶段,利用这些验证输出作为合成训练数据微调生成式过程奖励模型,随后在训练中作为奖励信号。研究表明,在步骤级聚合多个独立验证生成的过程奖励模型训练数据,其效果优于真实结果监督方法:在ProcessBench(数学推理错误步骤识别基准)上达到67.5 F1分数,优于参考引导训练的66.4和GPT-4o的61.9。第三阶段,将我们提出的生成式PRM与思维链验证(PRM-CoT)结合作为数学推理强化学习实验的奖励模型,并引入格式约束防止奖励攻击。基于Qwen2.5-Math-7B模型,在六项数学推理基准测试中平均准确率达到47.4%,超越基于真实数据的RLVR方法(43.9%)。本工作实现了超越真实数据方法的无参考强化学习训练,为缺乏可验证答案或可获取真实数据的领域开辟了新可能。