CFCEval：评估大型语言模型生成代码的安全方面 (CFCEval: Evaluating Security Aspects in Code Generated by Large Language Models)

Code-focused Large Language Models (LLMs), such as CodeX and Star-Coder, have demonstrated remarkable capabilities in enhancing developer productivity through context-aware code generation. However, evaluating the quality and security of LLM-generated code remains a significant challenge. Existing evaluation protocols for Code LLMs lack both methodological rigor and comprehensive scope. A key limitation is dataset bias, which arises from unintentional overlap between training and testing data. Furthermore, while CodeBLEU, a BLEU-based metric, is widely used to assess code similarity, it suffers from critical shortcomings, including imprecise tokenization, structural limitations, and low reference diversity. To address these challenges, we introduce CFCEval, a novel framework for evaluating the quality and security of code generated by LLMs. CFCEval mitigates dataset bias by creating a new benchmark, MLVBench, and incorporates ELRM, a new metric designed to assess the relevance between reference code and generated code. CFCEval evaluates generated code across four dimensions: programming quality, vulnerability-fixing capability, post-transformation fixing capability, and relevance. Our experiments show that CFCEval not only captures both quality and security aspects of generated code more effectively but also that its ELRM aligns more closely with human judgments than CodeBLEU, thus paving the way for future advancements in Code LLMs evaluation.

翻译：专注于代码的大型语言模型（LLMs），如CodeX和Star-Coder，已通过上下文感知的代码生成在提升开发者生产力方面展现出卓越能力。然而，评估LLM生成代码的质量与安全性仍是一个重大挑战。现有针对代码LLMs的评估协议在方法严谨性和覆盖范围上均存在不足。一个关键局限是数据集偏差，这源于训练数据与测试数据间的无意重叠。此外，尽管基于BLEU的指标CodeBLEU被广泛用于评估代码相似性，但其存在关键缺陷，包括不精确的标记化、结构限制以及参考多样性不足。为应对这些挑战，我们提出了CFCEval，一个用于评估LLM生成代码质量与安全性的新型框架。CFCEval通过创建新基准MLVBench来缓解数据集偏差，并引入了ELRM这一新指标，旨在评估参考代码与生成代码之间的相关性。CFCEval从四个维度评估生成代码：编程质量、漏洞修复能力、后转换修复能力以及相关性。实验表明，CFCEval不仅能更有效地捕捉生成代码的质量与安全方面，而且其ELRM指标比CodeBLEU更贴近人类判断，从而为未来代码LLMs评估的进步铺平了道路。