Finetuning specialized generative evaluators has emerged as a popular paradigm to meet the increasing demand for scalable evaluation during both training and test-time. However, recent work has largely focused on applying new methodology, such as reinforcement learning (RL), to training evaluators, shying away from large-scale, data-driven development. In this work, we focus on data scaling, curating a set of 2.5M samples spanning five unique evaluation tasks (pairwise, step-level, reference-free and reference-based verification, and single rating) and multiple domains focused on reasoning evaluation. With our data, we train Foundational Automatic Reasoning Evaluators (FARE), a family of 8B and 20B (with 3.6B active) parameter evaluators, with a simple iterative rejection-sampling supervised finetuning (SFT) approach. FARE-8B challenges larger specialized RL-trained evaluators and FARE-20B sets the new standard for open-source evaluators, surpassing specialized 70B+ evaluators. Beyond static benchmarks, we evaluate FARE in real-world tasks: As inference-time rerankers, FARE-20B achieves near-oracle performance on MATH. As verifiers in RL training, FARE improves the downstream RL-trained model performance by up to 14.1% vs. string-matching verifiers. When initialized from FARE, a continually-finetuned FARE-Code outperforms gpt-oss-20B by 65% on evaluating test-case quality.
翻译:微调专用生成式评估器已成为满足训练和测试期间可扩展评估需求的主流范式。然而,近期研究多集中于应用新方法(如强化学习)训练评估器,而回避大规模数据驱动开发。本研究聚焦于数据规模化,构建了一个包含250万个样本的数据集,涵盖五个独特的评估任务(成对比较、步骤级评估、无参考与有参考验证、单一评分)以及多个以推理评估为核心的领域。基于此数据,我们采用简单的迭代拒绝采样监督微调方法,训练了基础自动推理评估器系列模型,包括80亿参数和200亿参数(其中36亿参数激活)的评估器。FARE-8B模型挑战了更大规模的专用强化学习训练评估器,而FARE-20B则为开源评估器设立了新标准,超越了专用700亿以上参数的评估器。除静态基准测试外,我们还在实际任务中评估FARE:作为推理时重排序器,FARE-20B在MATH数据集上达到接近理论最优的性能;作为强化学习训练中的验证器,FARE相比字符串匹配验证器将下游强化学习模型性能提升最高达14.1%;以FARE为初始化基础持续微调得到的FARE-Code模型,在测试用例质量评估任务上性能超越gpt-oss-20B模型65%。