We present Athena-PRM, a multimodal process reward model (PRM) designed to evaluate the reward score for each step in solving complex reasoning problems. Developing high-performance PRMs typically demands significant time and financial investment, primarily due to the necessity for step-level annotations of reasoning steps. Conventional automated labeling methods, such as Monte Carlo estimation, often produce noisy labels and incur substantial computational costs. To efficiently generate high-quality process-labeled data, we propose leveraging prediction consistency between weak and strong completers as a criterion for identifying reliable process labels. Remarkably, Athena-PRM demonstrates outstanding effectiveness across various scenarios and benchmarks with just 5,000 samples. Furthermore, we also develop two effective strategies to improve the performance of PRMs: ORM initialization and up-sampling for negative data. We validate our approach in three specific scenarios: verification for test time scaling, direct evaluation of reasoning step correctness, and reward ranked fine-tuning. Our Athena-PRM consistently achieves superior performance across multiple benchmarks and scenarios. Notably, when using Qwen2.5-VL-7B as the policy model, Athena-PRM enhances performance by 10.2 points on WeMath and 7.1 points on MathVista for test time scaling. Furthermore, Athena-PRM sets the state-of-the-art (SoTA) results in VisualProcessBench and outperforms the previous SoTA by 3.9 F1-score, showcasing its robust capability to accurately assess the correctness of the reasoning step. Additionally, utilizing Athena-PRM as the reward model, we develop Athena-7B with reward ranked fine-tuning and outperforms baseline with a significant margin on five benchmarks.
翻译:我们提出了Athena-PRM,一种多模态过程奖励模型,旨在为复杂推理问题求解过程中的每个步骤评估奖励分数。开发高性能PRM通常需要大量的时间和资金投入,这主要源于对推理步骤进行逐级标注的必要性。传统的自动化标注方法(如蒙特卡洛估计)往往会产生噪声标签并带来巨大的计算成本。为了高效生成高质量的过程标注数据,我们提出利用弱完成器与强完成器之间的预测一致性作为识别可靠过程标签的标准。值得注意的是,Athena-PRM仅用5,000个样本就在多种场景和基准测试中展现出卓越的有效性。此外,我们还开发了两种提升PRM性能的有效策略:ORM初始化和负样本上采样。我们在三个具体场景中验证了我们的方法:测试时扩展的验证、推理步骤正确性的直接评估以及奖励排序微调。我们的Athena-PRM在多个基准测试和场景中始终取得优异性能。特别值得注意的是,当使用Qwen2.5-VL-7B作为策略模型时,Athena-PRM在测试时扩展任务上将WeMath上的性能提升了10.2个百分点,在MathVista上提升了7.1个百分点。此外,Athena-PRM在VisualProcessBench上取得了最先进的成果,并以3.9个F1分数的优势超越了先前的最优结果,这展示了其准确评估推理步骤正确性的强大能力。另外,利用Athena-PRM作为奖励模型,我们通过奖励排序微调开发了Athena-7B模型,在五个基准测试上均以显著优势超越了基线模型。