Spark-Prover-X1：通过多样化数据训练实现形式化定理证明 (Spark-Prover-X1: Formal Theorem Proving Through Diverse Data Training)

Large Language Models (LLMs) have shown significant promise in automated theorem proving, yet progress is often constrained by the scarcity of diverse and high-quality formal language data. To address this issue, we introduce Spark-Prover-X1, a 7B parameter model trained via an three-stage framework designed to unlock the reasoning potential of more accessible and moderately-sized LLMs. The first stage infuses deep knowledge through continuous pre-training on a broad mathematical corpus, enhanced by a suite of novel data tasks. Key innovation is a "CoT-augmented state prediction" task to achieve fine-grained reasoning. The second stage employs Supervised Fine-tuning (SFT) within an expert iteration loop to specialize both the Spark-Prover-X1-7B and Spark-Formalizer-X1-7B models. Finally, a targeted round of Group Relative Policy Optimization (GRPO) is applied to sharpen the prover's capabilities on the most challenging problems. To facilitate robust evaluation, particularly on problems from real-world examinations, we also introduce ExamFormal-Bench, a new benchmark dataset of 402 formal problems. Experimental results demonstrate that Spark-Prover achieves state-of-the-art performance among similarly-sized open-source models within the "Whole-Proof Generation" paradigm. It shows exceptional performance on difficult competition benchmarks, notably solving 27 problems on PutnamBench (pass@32) and achieving 24.0\% on CombiBench (pass@32). Our work validates that this diverse training data and progressively refined training pipeline provides an effective path for enhancing the formal reasoning capabilities of lightweight LLMs. Both Spark-Prover-X1-7B and Spark-Formalizer-X1-7B, along with the ExamFormal-Bench dataset, are made publicly available at: https://www.modelscope.cn/organization/iflytek, https://gitcode.com/ifly_opensource.

翻译：大型语言模型（LLMs）在自动定理证明领域展现出巨大潜力，但其进展常受限于多样化、高质量形式化语言数据的稀缺性。为解决此问题，我们提出了Spark-Prover-X1——一个通过三阶段训练框架开发的70亿参数模型，旨在释放更易获取的中等规模LLMs的推理潜能。第一阶段通过在大规模数学语料上进行持续预训练来注入深度知识，并辅以一系列创新的数据任务。关键创新在于引入“思维链增强状态预测”任务以实现细粒度推理。第二阶段在专家迭代循环中采用监督微调（SFT），专门优化Spark-Prover-X1-7B和Spark-Formalizer-X1-7B模型。最后，应用针对性的组相对策略优化（GRPO）来提升证明器在最复杂问题上的能力。为促进稳健评估（特别是针对真实考试问题），我们还提出了ExamFormal-Bench——一个包含402道形式化问题的新基准数据集。实验结果表明，在“全证明生成”范式下，Spark-Prover在同等规模的开源模型中达到了最先进的性能。它在高难度竞赛基准上表现卓越，特别是在PutnamBench上解决了27个问题（pass@32），在CombiBench上达到24.0%（pass@32）。我们的工作验证了这种多样化训练数据与渐进式精炼训练流程能为轻量级LLMs的形式推理能力提升提供有效路径。Spark-Prover-X1-7B和Spark-Formalizer-X1-7B模型及ExamFormal-Bench数据集已公开于：https://www.modelscope.cn/organization/iflytek, https://gitcode.com/ifly_opensource。

相关内容

Spark

关注 0

Apache Spark 是专为大规模数据处理而设计的快速通用的计算引擎。Spark是UC Berkeley AMP lab (加州大学伯克利分校的AMP实验室)所开源的类Hadoop MapReduce的通用并行框架，Spark，拥有Hadoop MapReduce所具有的优点；但不同于MapReduce的是Job中间输出结果可以保存在内存中，从而不再需要读写HDFS，因此Spark能更好地适用于数据挖掘与机器学习等需要迭代的MapReduce的算法。

UnHiPPO：面向不确定性的状态空间模型初始化方法

专知会员服务

11+阅读 · 6月6日

【AAAI2024】KAM-CoT: 知识增强的多模态思维链推理

专知会员服务

45+阅读 · 2024年1月24日

【CMU-Yuejie Chi等干货书】满足低秩矩阵分解的非凸优化综述，69页pdf，Nonconvex Optimization Meets Low-Rank Matrix Factorization: An Overview

专知会员服务

33+阅读 · 2022年3月4日

【SIGIR2020-中科院计算所】L2R2: 利用排名进行外展推理，L2R2: Leveraging Ranking for Abductive Reasoning

专知会员服务

11+阅读 · 2020年5月25日