Large Language Models (LLMs) have shown significant promise in automated theorem proving, yet progress is often constrained by the scarcity of diverse and high-quality formal language data. To address this issue, we introduce Spark-Prover-X1, a 7B parameter model trained via an three-stage framework designed to unlock the reasoning potential of more accessible and moderately-sized LLMs. The first stage infuses deep knowledge through continuous pre-training on a broad mathematical corpus, enhanced by a suite of novel data tasks. Key innovation is a "CoT-augmented state prediction" task to achieve fine-grained reasoning. The second stage employs Supervised Fine-tuning (SFT) within an expert iteration loop to specialize both the Spark-Prover-X1-7B and Spark-Formalizer-X1-7B models. Finally, a targeted round of Group Relative Policy Optimization (GRPO) is applied to sharpen the prover's capabilities on the most challenging problems. To facilitate robust evaluation, particularly on problems from real-world examinations, we also introduce ExamFormal-Bench, a new benchmark dataset of 402 formal problems. Experimental results demonstrate that Spark-Prover achieves state-of-the-art performance among similarly-sized open-source models within the "Whole-Proof Generation" paradigm. It shows exceptional performance on difficult competition benchmarks, notably solving 27 problems on PutnamBench (pass@32) and achieving 24.0\% on CombiBench (pass@32). Our work validates that this diverse training data and progressively refined training pipeline provides an effective path for enhancing the formal reasoning capabilities of lightweight LLMs. Both Spark-Prover-X1-7B and Spark-Formalizer-X1-7B, along with the ExamFormal-Bench dataset, are made publicly available at: https://www.modelscope.cn/organization/iflytek, https://gitcode.com/ifly_opensource.
翻译:大型语言模型(LLMs)在自动定理证明领域展现出巨大潜力,但其进展常受限于多样化、高质量形式化语言数据的稀缺性。为解决此问题,我们提出了Spark-Prover-X1——一个通过三阶段训练框架开发的70亿参数模型,旨在释放更易获取的中等规模LLMs的推理潜能。第一阶段通过在大规模数学语料上进行持续预训练来注入深度知识,并辅以一系列创新的数据任务。关键创新在于引入“思维链增强状态预测”任务以实现细粒度推理。第二阶段在专家迭代循环中采用监督微调(SFT),专门优化Spark-Prover-X1-7B和Spark-Formalizer-X1-7B模型。最后,应用针对性的组相对策略优化(GRPO)来提升证明器在最复杂问题上的能力。为促进稳健评估(特别是针对真实考试问题),我们还提出了ExamFormal-Bench——一个包含402道形式化问题的新基准数据集。实验结果表明,在“全证明生成”范式下,Spark-Prover在同等规模的开源模型中达到了最先进的性能。它在高难度竞赛基准上表现卓越,特别是在PutnamBench上解决了27个问题(pass@32),在CombiBench上达到24.0%(pass@32)。我们的工作验证了这种多样化训练数据与渐进式精炼训练流程能为轻量级LLMs的形式推理能力提升提供有效路径。Spark-Prover-X1-7B和Spark-Formalizer-X1-7B模型及ExamFormal-Bench数据集已公开于:https://www.modelscope.cn/organization/iflytek, https://gitcode.com/ifly_opensource。