Spark-Prover-X1：通过多样化数据训练实现形式化定理证明 (Spark-Prover-X1: Formal Theorem Proving Through Diverse Data Training)

Large Language Models (LLMs) have shown significant promise in automated theorem proving, yet progress is often constrained by the scarcity of diverse and high-quality formal language data. To address this issue, we introduce Spark-Prover-X1, a 7B parameter model trained via an three-stage framework designed to unlock the reasoning potential of more accessible and moderately-sized LLMs. The first stage infuses deep knowledge through continuous pre-training on a broad mathematical corpus, enhanced by a suite of novel data tasks. Key innovation is a "CoT-augmented state prediction" task to achieve fine-grained reasoning. The second stage employs Supervised Fine-tuning (SFT) within an expert iteration loop to specialize both the Spark-Prover-X1-7B and Spark-Formalizer-X1-7B models. Finally, a targeted round of Group Relative Policy Optimization (GRPO) is applied to sharpen the prover's capabilities on the most challenging problems. To facilitate robust evaluation, particularly on problems from real-world examinations, we also introduce ExamFormal-Bench, a new benchmark dataset of 402 formal problems. Experimental results demonstrate that Spark-Prover achieves state-of-the-art performance among similarly-sized open-source models within the "Whole-Proof Generation" paradigm. It shows exceptional performance on difficult competition benchmarks, notably solving 27 problems on PutnamBench (pass@32) and achieving 24.0\% on CombiBench (pass@32). Our work validates that this diverse training data and progressively refined training pipeline provides an effective path for enhancing the formal reasoning capabilities of lightweight LLMs. We will release both Spark-Prover-X1-7B and Spark-Formalizer-X1-7B, along with the ExamFormal-Bench dataset, in the near future.

翻译：大型语言模型（LLMs）在自动定理证明领域展现出巨大潜力，但其进展常受限于高质量、多样化的形式语言数据稀缺问题。为解决此问题，我们提出了Spark-Prover-X1——一个基于三阶段训练框架构建的70亿参数模型，旨在释放中等规模LLMs的推理潜能。第一阶段通过在大规模数学语料上进行持续预训练，并辅以一系列创新数据任务来注入深度知识，其中关键创新是引入“思维链增强状态预测”任务以实现细粒度推理。第二阶段在专家迭代循环中采用监督微调（SFT），专门优化Spark-Prover-X1-7B与Spark-Formalizer-X1-7B模型。最后，通过针对性的分组相对策略优化（GRPO）轮次，提升证明器在最复杂问题上的能力。为支持稳健评估（特别是针对现实考试问题），我们同时提出了ExamFormal-Bench——一个包含402道形式化问题的新基准数据集。实验结果表明，在“全证明生成”范式下，Spark-Prover在同等规模开源模型中达到了最先进的性能水平。其在困难竞赛基准测试中表现卓越，特别是在PutnamBench上解决了27个问题（pass@32），在CombiBench上达到24.0%（pass@32）。本研究验证了多样化训练数据与渐进式精炼训练流程能为轻量化LLMs的形式推理能力提升提供有效路径。我们即将公开发布Spark-Prover-X1-7B、Spark-Formalizer-X1-7B模型及ExamFormal-Bench数据集。

相关内容

Spark

关注 0

Apache Spark 是专为大规模数据处理而设计的快速通用的计算引擎。Spark是UC Berkeley AMP lab (加州大学伯克利分校的AMP实验室)所开源的类Hadoop MapReduce的通用并行框架，Spark，拥有Hadoop MapReduce所具有的优点；但不同于MapReduce的是Job中间输出结果可以保存在内存中，从而不再需要读写HDFS，因此Spark能更好地适用于数据挖掘与机器学习等需要迭代的MapReduce的算法。

DeepSeek模型综述：V1 V2 V3 R1-Zero

专知会员服务

116+阅读 · 2月11日

【NeurIPS2020】可处理的反事实推理的深度结构因果模型

专知会员服务

49+阅读 · 2020年9月28日

【SIGIR2020-中科院计算所】L2R2: 利用排名进行外展推理，L2R2: Leveraging Ranking for Abductive Reasoning

专知会员服务

11+阅读 · 2020年5月25日

Query2box: 使用盒嵌入对向量空间中的知识图谱进行推理，Query2box: Reasoning over Knowledge Graphs in Vector Space Using Box Embeddings

专知会员服务

46+阅读 · 2020年5月11日