Recent advances in large language models (LLMs) have demonstrated impressive capabilities in formal theorem proving, particularly on contest-based mathematical benchmarks like the IMO. However, these contests do not reflect the depth, breadth, and abstraction of modern mathematical research. To bridge this gap, we introduce FATE (Formal Algebra Theorem Evaluation), a new benchmark series in formal algebra designed to chart a course toward advanced mathematical reasoning. We present two new components, FATE-H and FATE-X, each with 100 problems in abstract and commutative algebra. The FATE series spans a difficulty spectrum from undergraduate exercises to problems exceeding PhD qualifying exams. Notably, FATE-X is the first formal benchmark to surpass both PhD-level exam difficulty and the coverage of the Mathlib library. Our evaluations of state-of-the-art LLM provers on this new benchmark reveal a stark performance gap compared to contest math: the best model achieves only 3% (pass@64) accuracy on FATE-H and 0% on FATE-X. Our two-stage evaluation reveals that models' natural-language reasoning is notably more accurate than their ability to formalize this reasoning. We systematically classify the common errors that arise during this formalization process. Furthermore, a comparative study shows that a specialized prover can exhibit less effective reflection than general-purpose models, reducing its accuracy at the natural-language stage. We believe FATE provides a robust and challenging benchmark that establishes essential checkpoints on the path toward research-level formal mathematical reasoning.
翻译:近期大型语言模型(LLM)在形式定理证明方面取得了显著进展,特别是在IMO等竞赛类数学基准测试中表现突出。然而,这些竞赛未能体现现代数学研究的深度、广度与抽象性。为弥合这一差距,我们提出了FATE(形式代数定理评估),这是一个旨在为高级数学推理规划路径的形式代数新基准系列。我们引入了两个新组件FATE-H和FATE-X,各包含100道抽象代数与交换代数问题。FATE系列涵盖从本科习题到超越博士资格考试难度的完整难度谱系。值得注意的是,FATE-X是首个在难度上超越博士水平考试、在覆盖范围上超越Mathlib库的形式基准。我们对前沿LLM证明器在该基准上的评估显示,与竞赛数学相比存在显著性能差距:最佳模型在FATE-H上仅达到3%(pass@64)准确率,在FATE-X上则为0%。我们的两阶段评估表明,模型的自然语言推理能力明显优于其形式化推理的能力。我们系统分类了形式化过程中出现的常见错误类型。此外,对比研究表明,专用证明器可能比通用模型表现出更弱的反思能力,导致其在自然语言阶段的准确率下降。我们相信FATE为研究级形式数学推理之路建立了关键检查点,提供了一个稳健且具有挑战性的基准。