The rapid advancement of pre-trained language models (PLMs) has demonstrated promising results for various code-related tasks. However, their effectiveness in detecting real-world vulnerabilities remains a critical challenge. While existing empirical studies evaluate PLMs for vulnerability detection (VD), they suffer from data leakage, limited scope, and superficial analysis, hindering the accuracy and comprehensiveness of evaluations. This paper begins by revisiting the common issues in existing research on PLMs for VD through the evaluation pipeline. It then proceeds with an accurate and extensive evaluation of 18 PLMs on high-quality datasets that feature accurate labeling, diverse vulnerability types, and various projects. Specifically, we compare the performance of PLMs under both fine-tuning and prompt engineering, assess their effectiveness and generalizability across various training and testing settings, and analyze their robustness to a series of perturbations. Our findings reveal that PLMs incorporating pre-training tasks designed to capture the syntactic and semantic patterns of code outperform both general-purpose PLMs and those solely pre-trained or fine-tuned on large code corpora. However, these models face notable challenges in real-world scenarios, such as difficulties in detecting vulnerabilities with complex dependencies, handling perturbations introduced by code normalization and abstraction, and identifying semantic-preserving vulnerable code transformations. Also, the truncation caused by the limited context windows of PLMs can lead to a non-negligible number of labeling errors, which is overlooked by previous work. This study underscores the importance of thorough evaluations of model performance in practical scenarios and outlines future directions to help enhance the effectiveness of PLMs for realistic VD applications.
翻译:预训练语言模型(PLMs)的快速发展已在多项代码相关任务中展现出有前景的结果。然而,其在检测现实世界漏洞方面的有效性仍是一个关键挑战。尽管现有实证研究评估了PLMs在漏洞检测(VD)中的应用,但这些研究存在数据泄露、范围有限和分析浅表等问题,影响了评估的准确性和全面性。本文首先通过评估流程重新审视了现有PLMs应用于VD研究中的常见问题,随后基于高质量数据集(具备精确标注、多样漏洞类型及多项目来源)对18种PLMs进行了准确且广泛的评估。具体而言,我们比较了PLMs在微调与提示工程两种模式下的性能,评估了其在多种训练与测试场景中的有效性和泛化能力,并分析了其面对一系列扰动时的鲁棒性。研究发现,那些融合了专门设计用于捕捉代码语法与语义模式的预训练任务的PLMs,其表现优于通用PLMs以及仅在大规模代码语料上进行预训练或微调的模型。然而,这些模型在现实场景中仍面临显著挑战,例如难以检测具有复杂依赖关系的漏洞、处理由代码规范化与抽象引入的扰动,以及识别语义保持的漏洞代码变换。此外,受限于PLMs的上下文窗口长度而导致的截断问题可能引发不可忽视的标注错误,这一点在以往工作中被忽视。本研究强调了在实际场景中对模型性能进行全面评估的重要性,并指出了未来研究方向,以助力提升PLMs在现实VD应用中的有效性。