Large reasoning models (e.g., R1, o3) have demonstrated remarkable mathematical problem-solving abilities. However, the high reported accuracy of these advanced models on popular datasets, reliance on purely numerical evaluation and potential benchmark leakage, often masks their true reasoning shortcomings. To address this, we propose leveraging the inherent rigor and methodological complexity of mathematical proofs as a diagnostic tool to expose these hidden failures. Specifically, we introduce the RFMDataset (Reveal Failure Modes), a collection of 200 diverse mathematical proof problems, and thoroughly evaluate advanced models' performance on it. Our in-depth analysis of their failures uncovers 10 fine-grained error types, which shows fundamental limitations in current large reasoning models: 1) large reasoning models grapple profoundly with mathematical proofs, with some generating entirely correct proofs for less than 20% of problems and failing even on basic ones; 2) models exhibit a diverse spectrum of reasoning failures, prominently demonstrating the lack of guarantees for the correctness and rigor of single-step reasoning; and 3) models show hallucination and incompleteness during the reasoning process. Our findings reveal that models' self-reflection is insufficient to resolve the current logical dilemmas, necessitating formalized and fine-grained logical training.
翻译:大型推理模型(例如R1、o3)已展现出卓越的数学问题解决能力。然而,这些先进模型在流行数据集上报告的高准确率、对纯数值评估的依赖以及潜在的基准泄露问题,往往掩盖了其真实的推理缺陷。为解决此问题,我们提出利用数学证明固有的严谨性和方法复杂性作为诊断工具,以揭示这些隐藏的失败。具体而言,我们引入了RFMDataset(揭示失败模式数据集),这是一个包含200个多样化数学证明问题的集合,并全面评估了先进模型在其上的表现。我们对模型失败的深入分析揭示了10种细粒度错误类型,这显示了当前大型推理模型存在根本性局限:1)大型推理模型在数学证明方面存在显著困难,部分模型对少于20%的问题能生成完全正确的证明,甚至在基础问题上也会失败;2)模型表现出多样化的推理失败模式,突出表明其单步推理的正确性和严谨性缺乏保证;3)模型在推理过程中表现出幻觉与不完整性。我们的研究结果表明,模型的自反思不足以解决当前的逻辑困境,需要形式化且细粒度的逻辑训练。