Fault localization (FL) is a critical step in debugging, which typically relies on repeated executions to pinpoint faulty code regions. However, repeated executions can be impractical in the presence of non-deterministic failures or high execution costs. While recent efforts have leveraged Large Language Models (LLMs) to aid execution-free FL, these have primarily focused on identifying faults in the system-under-test (SUT) rather than in the often complex system-level test code. However, the latter is also important, as in practice, many failures are triggered by faulty test code. To overcome these challenges, we introduce a fully static, LLM-driven approach for system-level test code fault localization (TCFL) that does not require executing the test case. Our method uses a single failure execution log to estimate the test's execution trace through three novel algorithms that identify only code statements likely involved in the failure. This pruned trace, combined with the error message, is used to prompt the LLM to rank potential faulty locations. Our black-box, system-level approach requires no access to the SUT source code and is applicable to complex test scripts that assess full system behavior. We evaluate our technique at the function, block, and line levels using an industrial dataset of faulty test cases that were not used in pre-training LLMs. Results show that our best-estimated traces closely match the actual traces, with an F1 score of around 90%. Additionally, pruning the complex system-level test code reduces the LLM's inference time by up to 34% without any loss in FL performance. Our method achieves equal or higher FL accuracy, requiring over 85% less average inference time per test case and 93% fewer tokens than the latest LLM-guided FL method.
翻译:故障定位是调试过程中的关键步骤,通常依赖于重复执行以精确定位故障代码区域。然而,在存在非确定性故障或高执行成本的情况下,重复执行可能不切实际。尽管近期研究已利用大型语言模型辅助无需执行的故障定位,但这些工作主要集中于识别被测系统中的故障,而非通常复杂的系统级测试代码中的故障。然而,后者同样重要,因为实践中许多故障是由错误的测试代码触发的。为应对这些挑战,我们提出了一种完全静态、基于大型语言模型的系统级测试代码故障定位方法,无需执行测试用例。我们的方法利用单次故障执行日志,通过三种新颖算法估计测试的执行轨迹,仅识别可能涉及故障的代码语句。该修剪后的轨迹与错误信息结合,用于提示大型语言模型对潜在故障位置进行排序。我们的黑盒系统级方法无需访问被测系统源代码,适用于评估完整系统行为的复杂测试脚本。我们使用工业数据集中的故障测试用例(未用于大型语言模型预训练)在函数、块和行级别评估了该技术。结果表明,我们最佳估计的轨迹与实际轨迹高度吻合,F1分数约为90%。此外,修剪复杂系统级测试代码可将大型语言模型的推理时间减少高达34%,且不影响故障定位性能。与最新的基于大型语言模型的故障定位方法相比,我们的方法实现了同等或更高的定位准确率,每个测试案例的平均推理时间减少超过85%,令牌使用量减少93%。