Secure-Instruct：一种利用LLMs自动合成指令调优数据集以增强安全代码生成的流水线 (Secure-Instruct: An Automated Pipeline for Synthesizing Instruction-Tuning Datasets Using LLMs for Secure Code Generation)

Although Large Language Models (LLMs) show promising solutions to automated code generation, they often produce insecure code that threatens software security. Current approaches (e.g., SafeCoder) to improve secure code generation are limited by small, imbalanced instruction-tuning datasets. In this work, we present Secure-Instruct, a novel pipeline that automatically synthesizes high-quality vulnerable and secure code examples and instruction-tunes LLMs to align task description and secure code generation abilities. We evaluate Secure-Instruct on four representative LLMs using two security-related benchmarks: our own CWEBench and the existing CWEval. CWEBench comprises 93 scenarios on 44 CWEs, all without overlap with Secure-Instruct's synthetic instruction-tuning dataset, while CWEval covers 31 CWEs with 119 manually verified security-critical tasks. We find that Secure-Instruct improves both security and functional correctness in code generation. On CWEBench, Secure-Instruct substantially improves secure code generation, giving a 28.5% increase on average in secure ratio over the pre-trained models and outperforms SafeCoder by 12.6%. On CWEval, Secure-Instruct achieves an increase of 157.3% for CodeLlama-7B and 46.4% for Mistral-7B in Func-Sec@1 over pretrained models, and significantly outperforms SafeCoder.

翻译：尽管大型语言模型（LLMs）在自动化代码生成方面展现出有前景的解决方案，但它们常常生成不安全的代码，从而威胁软件安全。当前改进安全代码生成的方法（例如SafeCoder）受限于规模小、不平衡的指令调优数据集。在本工作中，我们提出了Secure-Instruct，这是一种新颖的流水线，能够自动合成高质量的漏洞代码和安全代码示例，并通过指令调优使LLMs对齐任务描述与安全代码生成能力。我们使用两个安全相关基准测试（我们自建的CWEBench和现有的CWEval）在四个代表性LLMs上评估Secure-Instruct。CWEBench包含44个CWE的93个场景，所有场景均不与Secure-Instruct的合成指令调优数据集重叠；而CWEval涵盖31个CWE的119个经过人工验证的安全关键任务。我们发现，Secure-Instruct在代码生成中同时提升了安全性和功能正确性。在CWEBench上，Secure-Instruct显著改进了安全代码生成，相较于预训练模型，安全比率平均提升28.5%，并优于SafeCoder 12.6%。在CWEval上，Secure-Instruct在Func-Sec@1指标上，CodeLlama-7B提升了157.3%，Mistral-7B提升了46.4%，均显著优于预训练模型和SafeCoder。