Large Language Models (LLMs) have shown significant promise in automated theorem proving, yet progress is often constrained by the scarcity of diverse and high-quality formal language data. To address this issue, we introduce Spark-Prover-X1, a 7B parameter model trained via an three-stage framework designed to unlock the reasoning potential of more accessible and moderately-sized LLMs. The first stage infuses deep knowledge through continuous pre-training on a broad mathematical corpus, enhanced by a suite of novel data tasks. Key innovation is a "CoT-augmented state prediction" task to achieve fine-grained reasoning. The second stage employs Supervised Fine-tuning (SFT) within an expert iteration loop to specialize both the Spark-Prover-X1-7B and Spark-Formalizer-X1-7B models. Finally, a targeted round of Group Relative Policy Optimization (GRPO) is applied to sharpen the prover's capabilities on the most challenging problems. To facilitate robust evaluation, particularly on problems from real-world examinations, we also introduce ExamFormal-Bench, a new benchmark dataset of 402 formal problems. Experimental results demonstrate that Spark-Prover achieves state-of-the-art performance among similarly-sized open-source models within the "Whole-Proof Generation" paradigm. It shows exceptional performance on difficult competition benchmarks, notably solving 27 problems on PutnamBench (pass@32) and achieving 24.0\% on CombiBench (pass@32). Our work validates that this diverse training data and progressively refined training pipeline provides an effective path for enhancing the formal reasoning capabilities of lightweight LLMs. We will release both Spark-Prover-X1-7B and Spark-Formalizer-X1-7B, along with the ExamFormal-Bench dataset, in the near future.
翻译:大型语言模型(LLMs)在自动定理证明领域展现出巨大潜力,但其进展常受限于高质量、多样化的形式语言数据稀缺问题。为解决此问题,我们提出了Spark-Prover-X1——一个基于三阶段训练框架构建的70亿参数模型,旨在释放中等规模LLMs的推理潜能。第一阶段通过在大规模数学语料上进行持续预训练,并辅以一系列创新数据任务来注入深度知识,其中关键创新是引入“思维链增强状态预测”任务以实现细粒度推理。第二阶段在专家迭代循环中采用监督微调(SFT),专门优化Spark-Prover-X1-7B与Spark-Formalizer-X1-7B模型。最后,通过针对性的分组相对策略优化(GRPO)轮次,提升证明器在最复杂问题上的能力。为支持稳健评估(特别是针对现实考试问题),我们同时提出了ExamFormal-Bench——一个包含402道形式化问题的新基准数据集。实验结果表明,在“全证明生成”范式下,Spark-Prover在同等规模开源模型中达到了最先进的性能水平。其在困难竞赛基准测试中表现卓越,特别是在PutnamBench上解决了27个问题(pass@32),在CombiBench上达到24.0%(pass@32)。本研究验证了多样化训练数据与渐进式精炼训练流程能为轻量化LLMs的形式推理能力提升提供有效路径。我们即将公开发布Spark-Prover-X1-7B、Spark-Formalizer-X1-7B模型及ExamFormal-Bench数据集。