大型语言模型能否模拟类似KLEE的符号执行输出？ (Can Large Language Models Simulate Symbolic Execution Output Like KLEE?)

Symbolic execution helps check programs by exploring different paths based on symbolic inputs. Tools like KLEE are commonly used because they can automatically detect bugs and create test cases. But one of KLEE's biggest issues is how slow it can get when programs have lots of branching paths-it often becomes too resource-heavy to run on large or complex code. In this project, we wanted to see if a large language model like GPT-4o could simulate the kinds of outputs that KLEE generates. The idea was to explore whether LLMs could one day replace parts of symbolic execution to save time and resources. One specific goal was to have GPT-4o identify the most constrained path in a program, this is the execution path with the most symbolic conditions. These paths are especially important because they often represent edge cases that are harder to test and more likely to contain deep bugs. However, figuring this out usually requires fully running KLEE, which can be expensive. So, we tested whether GPT-4o could predict the KLEE outputs and the most complex path using a dataset of 100 C programs. Our results showed about 20% accuracy in generating KLEE-like outputs and identifying the most constrained path. While not highly accurate, this early work helps show what current LLMs can and can't do when it comes to simulating symbolic execution.

翻译：符号执行通过基于符号输入探索不同路径来辅助程序检查。诸如KLEE等工具因其能自动检测错误并生成测试用例而被广泛使用。然而，KLEE面临的主要问题在于，当程序包含大量分支路径时，其执行速度会显著下降——在处理大规模或复杂代码时，常因资源消耗过高而难以运行。本研究旨在探究GPT-4o等大型语言模型能否模拟KLEE生成的输出类型。核心思路是探索大型语言模型未来是否可能替代部分符号执行过程以节省时间和计算资源。具体目标之一是让GPT-4o识别程序中最受约束的路径，即包含最多符号条件的执行路径。这些路径尤为重要，因为它们通常代表难以测试的边缘情况，且更可能隐藏深层错误。但传统方法需完整运行KLEE才能确定此类路径，计算成本高昂。为此，我们基于100个C语言程序数据集，测试了GPT-4o预测KLEE输出及识别最复杂路径的能力。实验结果显示，在生成类KLEE输出和识别最受约束路径方面，模型达到约20%的准确率。尽管精度有限，这项早期研究揭示了当前大型语言模型在模拟符号执行任务中的能力边界与局限性。