基于Llama的源代码漏洞检测：提示工程与微调对比研究 (Llama-based source code vulnerability detection: Prompt engineering vs Fine tuning)

The significant increase in software production, driven by the acceleration of development cycles over the past two decades, has led to a steady rise in software vulnerabilities, as shown by statistics published yearly by the CVE program. The automation of the source code vulnerability detection (CVD) process has thus become essential, and several methods have been proposed ranging from the well established program analysis techniques to the more recent AI-based methods. Our research investigates Large Language Models (LLMs), which are considered among the most performant AI models to date, for the CVD task. The objective is to study their performance and apply different state-of-the-art techniques to enhance their effectiveness for this task. We explore various fine-tuning and prompt engineering settings. We particularly suggest one novel approach for fine-tuning LLMs which we call Double Fine-tuning, and also test the understudied Test-Time fine-tuning approach. We leverage the recent open-source Llama-3.1 8B, with source code samples extracted from BigVul and PrimeVul datasets. Our conclusions highlight the importance of fine-tuning to resolve the task, the performance of Double tuning, as well as the potential of Llama models for CVD. Though prompting proved ineffective, Retrieval augmented generation (RAG) performed relatively well as an example selection technique. Overall, some of our research questions have been answered, and many are still on hold, which leaves us many future work perspectives. Code repository is available here: https://github.com/DynaSoumhaneOuchebara/Llama-based-vulnerability-detection.

翻译：过去二十年间，软件开发周期的加速推动了软件产量的显著增长，正如CVE项目每年发布的统计数据所示，这导致了软件漏洞的持续增加。因此，源代码漏洞检测（CVD）过程的自动化变得至关重要，目前已提出了多种方法，从成熟的程序分析技术到近期基于人工智能的方法均有涵盖。本研究针对CVD任务，探讨了目前被认为性能最优的人工智能模型之一——大语言模型（LLMs）。目标是研究其性能，并应用不同的前沿技术以提升其在该任务中的效能。我们探索了多种微调和提示工程设置。特别地，我们提出了一种新颖的LLM微调方法，称为双重微调，并测试了尚未充分研究的测试时微调方法。我们利用了近期开源的Llama-3.1 8B模型，并从BigVul和PrimeVul数据集中提取源代码样本。我们的结论强调了微调对于解决该任务的重要性、双重微调的性能，以及Llama模型在CVD任务中的潜力。尽管提示工程效果不佳，但检索增强生成（RAG）作为一种示例选择技术表现相对较好。总体而言，我们的部分研究问题已得到解答，许多问题仍有待探索，这为我们提供了广阔的未来工作前景。代码仓库地址：https://github.com/DynaSoumhaneOuchebara/Llama-based-vulnerability-detection。