The key limitation of the verification performance lies in the ability of error detection. With this intuition we designed several variants of pessimistic verification, which are simple workflows that could significantly improve the verification of open-ended math questions. In pessimistic verification we construct multiple parallel verifications for the same proof, and the proof is deemed incorrect if any one of them reports an error. This simple technique significantly improves the performance across many math verification benchmarks without incurring substantial computational resources. Its token efficiency even surpassed extended long-CoT in test-time scaling. Our case studies further indicate that the majority of false negatives in stronger models are actually caused by annotation errors in the original dataset, so our method's performance is in fact underestimated. Self-verification for mathematical problems can effectively improve the reliability and performance of language model outputs, and it also plays a critical role in enabling long-horizon mathematical tasks. We believe that research on pessimistic verification will help enhance the mathematical capabilities of language models across a wide range of tasks.
翻译:验证性能的关键局限在于错误检测能力。基于这一直觉,我们设计了多种悲观验证的变体,这些简单的工作流能显著提升开放型数学问题的验证效果。在悲观验证中,我们针对同一证明构建多个并行验证过程,只要其中任一过程报告错误,该证明即被判定为不正确。这一简单技术在多类数学验证基准测试中显著提升了性能,且未消耗大量计算资源。其令牌效率甚至在测试时间扩展场景下超越了扩展型长链思维推理。我们的案例研究进一步表明,较强模型中的多数假阴性结果实际上源于原始数据集中的标注错误,因此本方法的性能实际上被低估了。数学问题的自我验证能有效提升语言模型输出的可靠性与性能,同时在实现长视野数学任务中发挥着关键作用。我们相信,针对悲观验证的研究将有助于增强语言模型在广泛任务中的数学能力。