We introduce IndiMathBench, a human-verified benchmark designed to evaluate mathematical theorem proving, curated using an AI-powered human-assisted pipeline for formalizing natural language problems in Lean. IndiMathBench is composed of 312 formal Lean 4 theorems paired with their corresponding informal problem statements, sourced from Indian Mathematics Olympiads. Through category-based retrieval, iterative compiler feedback, and multi-model ensembles, our pipeline generates candidate formalizations that experts efficiently validate via an interactive dashboard with automated quality summaries. Evaluation across multiple frontier models demonstrates that autoformalization remains challenging, with substantial gaps between syntactic validity and semantic correctness, while theorem proving success rates remain low even with iterative refinement, demonstrating that \benchmark~presents a challenging testbed for mathematical reasoning. IndiMathBench is available at https://github.com/prmbiy/IndiMathBench.
翻译:本文介绍IndiMathBench,这是一个经人工验证的基准测试集,旨在评估数学定理证明能力。该基准通过一个基于人工智能的人机协同流程构建,用于将自然语言问题形式化为Lean语言。IndiMathBench包含312个形式化的Lean 4定理及其对应的非形式化问题陈述,所有问题均源自印度数学奥林匹克竞赛。通过基于类别的检索、迭代式编译器反馈以及多模型集成,我们的流程生成候选形式化结果,并由专家通过配备自动化质量摘要的交互式仪表板进行高效验证。对多个前沿模型的评估表明,自动形式化仍具挑战性,语法有效性与语义正确性之间存在显著差距;即使经过迭代优化,定理证明的成功率依然较低,这证明\\benchmark~为数学推理提供了一个具有挑战性的测试平台。IndiMathBench可在https://github.com/prmbiy/IndiMathBench获取。