Large language models show promise for vulnerability discovery, yet prevailing methods inspect code in isolation, struggle with long contexts, and focus on coarse function or file level detections which offers limited actionable guidance to engineers who need precise line-level localization and targeted patches in real-world software development. We present T2L-Agent (Trace-to-Line Agent), a project-level, end-to-end framework that plans its own analysis and progressively narrows scope from modules to exact vulnerable lines. T2L-Agent couples multi-round feedback with an Agentic Trace Analyzer (ATA) that fuses run-time evidence such as crash points, stack traces, and coverage deltas with AST-based code chunking, enabling iterative refinement beyond single pass predictions and translating symptoms into actionable, line-level diagnoses. To benchmark line-level vulnerability discovery, we introduce T2L-ARVO, a diverse, expert-verified 50-case benchmark spanning five crash families and real-world projects. T2L-ARVO is specifically designed to support both coarse-grained detection and fine-grained localization, enabling rigorous evaluation of systems that aim to move beyond file-level predictions. On T2L-ARVO, T2L-Agent achieves up to 58.0% detection and 54.8% line-level localization, substantially outperforming baselines. Together, the framework and benchmark push LLM-based vulnerability detection from coarse identification toward deployable, robust, precision diagnostics that reduce noise and accelerate patching in open-source software workflows.
翻译:大型语言模型在漏洞发现方面展现出潜力,然而主流方法孤立地审查代码、难以处理长上下文,且主要关注粗粒度的函数或文件级检测,这为需要精确行级定位和针对性补丁的现实世界软件开发工程师提供的可操作指导有限。我们提出了T2L-Agent(追踪至行级智能体),一个项目级、端到端的框架,该框架自主规划分析过程,并逐步将范围从模块缩小至确切的易受攻击代码行。T2L-Agent将多轮反馈与智能体追踪分析器相结合,该分析器融合了崩溃点、堆栈跟踪和覆盖率差异等运行时证据,并基于抽象语法树进行代码分块,实现了超越单次预测的迭代精化,从而将症状转化为可操作的行级诊断。为了对行级漏洞发现进行基准测试,我们引入了T2L-ARVO,这是一个多样化、经专家验证的包含50个案例的基准数据集,涵盖五种崩溃类型和现实世界项目。T2L-ARVO专门设计用于支持粗粒度检测和细粒度定位,从而能够对旨在超越文件级预测的系统进行严格评估。在T2L-ARVO上,T2L-Agent实现了高达58.0%的检测率和54.8%的行级定位率,显著优于基线方法。该框架与基准数据集共同推动了基于LLM的漏洞检测从粗粒度识别向可部署、鲁棒、精确的诊断方向发展,从而减少开源软件工作流中的噪声并加速补丁应用。