Automatically generated code is gaining traction recently, owing to the prevalence of Large Language Models (LLMs). Further, the AlphaProof initiative has demonstrated the possibility of using AI for general mathematical reasoning. Reasoning about computer programs (software) can be accomplished via general mathematical reasoning; however, it tends to be more structured and richer in contexts. This forms an attractive proposition, since then AI agents can be used to reason about voluminous code that gets generated by AI. In this work, we present a first LLM agent, AutoRocq, for conducting program verification. Unlike past works, which rely on extensive training of LLMs on proof examples, our agent learns on-the-fly and improves the proof via an iterative refinement loop. The iterative improvement of the proof is achieved by the proof agent communicating with the Rocq (formerly Coq) theorem prover to get additional context and feedback. The final result of the iteration is a proof derivation checked by the Rocq theorem prover. In this way, our proof construction involves autonomous collaboration between the proof agent and the theorem prover. This autonomy facilitates the search for proofs and decision-making in deciding on the structure of the proof tree. Experimental evaluation on SV-COMP benchmarks and on Linux kernel modules shows promising efficacy in achieving automated program verification. As automation in code generation becomes more widespread, we posit that our proof agent can be potentially integrated with AI coding agents to achieve a generate and validate loop, thus moving closer to the vision of trusted automatic programming.
翻译:近年来,得益于大型语言模型(LLMs)的普及,自动生成代码正日益受到关注。此外,AlphaProof项目展示了利用人工智能进行通用数学推理的可能性。对计算机程序(软件)的推理可通过通用数学推理实现,但其往往更具结构性和丰富的上下文信息。这构成了一个极具吸引力的命题,因为人工智能代理可用于推理由AI生成的大量代码。在本研究中,我们提出了首个用于程序验证的LLM代理——AutoRocq。与以往依赖对证明示例进行大量LLM训练的工作不同,我们的代理通过迭代优化循环进行实时学习并改进证明。证明的迭代改进通过证明代理与Rocq(原Coq)定理证明器通信以获取额外上下文和反馈来实现。迭代的最终结果是由Rocq定理证明器验证的证明推导。通过这种方式,我们的证明构建涉及证明代理与定理证明器之间的自主协作。这种自主性促进了证明搜索和证明树结构决策的制定。在SV-COMP基准测试和Linux内核模块上的实验评估显示,其在实现自动化程序验证方面具有显著效能。随着代码生成自动化的日益普及,我们认为该证明代理有望与AI编码代理集成,实现生成与验证的闭环,从而更接近可信自动编程的愿景。