LLM-as-a-Judge在软件工程中的应用：文献综述、愿景与未来之路 (LLM-as-a-Judge for Software Engineering: Literature Review, Vision, and the Road Ahead)

The rapid integration of Large Language Models (LLMs) into software engineering (SE) has revolutionized tasks like code generation, producing a massive volume of software artifacts. This surge has exposed a critical bottleneck: the lack of scalable, reliable methods to evaluate these outputs. Human evaluation is costly and time-consuming, while traditional automated metrics like BLEU fail to capture nuanced quality aspects. In response, the LLM-as-a-Judge paradigm - using LLMs for automated evaluation - has emerged. This approach leverages the advanced reasoning of LLMs, offering a path toward human-like nuance at automated scale. However, LLM-as-a-Judge research in SE is still in its early stages. This forward-looking SE 2030 paper aims to steer the community toward advancing LLM-as-a-Judge for evaluating LLM-generated software artifacts. We provide a literature review of existing SE studies, analyze their limitations, identify key research gaps, and outline a detailed roadmap. We envision these frameworks as reliable, robust, and scalable human surrogates capable of consistent, multi-faceted artifact evaluation by 2030. Our work aims to foster research and adoption of LLM-as-a-Judge frameworks, ultimately improving the scalability of software artifact evaluation.

翻译：大型语言模型（LLMs）在软件工程（SE）中的快速集成，已彻底改变了代码生成等任务，产生了海量的软件制品。这一激增暴露了一个关键瓶颈：缺乏可扩展且可靠的评估方法来衡量这些输出。人工评估成本高昂且耗时，而传统的自动化指标（如BLEU）无法捕捉细微的质量维度。为此，LLM-as-a-Judge范式——利用LLMs进行自动化评估——应运而生。该方法借助LLMs的高级推理能力，为在自动化规模上实现类人化的细微评估提供了路径。然而，LLM-as-a-Judge在软件工程中的研究仍处于早期阶段。这篇前瞻性的SE 2030论文旨在引导社区推进LLM-as-a-Judge在评估LLM生成软件制品方面的应用。我们提供了现有软件工程研究的文献综述，分析了其局限性，识别了关键的研究空白，并勾勒了详细的路线图。我们设想，到2030年，这些框架将成为可靠、鲁棒且可扩展的人类替代者，能够对软件制品进行一致且多方面的评估。我们的工作旨在促进LLM-as-a-Judge框架的研究与采用，最终提升软件制品评估的可扩展性。

相关内容

Engineering

关注 6

《工程》是中国工程院（CAE）于2015年推出的国际开放存取期刊。其目的是提供一个高水平的平台，传播和分享工程研发的前沿进展、当前主要研究成果和关键成果；报告工程科学的进展，讨论工程发展的热点、兴趣领域、挑战和前景，在工程中考虑人与环境的福祉和伦理道德，鼓励具有深远经济和社会意义的工程突破和创新，使之达到国际先进水平，成为新的生产力，从而改变世界，造福人类，创造新的未来。期刊链接：https://www.sciencedirect.com/journal/engineering

FlowQA: Grasping Flow in History for Conversational Machine Comprehension

专知会员服务

34+阅读 · 2019年10月18日

Auto-Sizing the Transformer Network: Improving Speed, Efficiency, and Performance for Low-Resource Machine Translation

专知会员服务

50+阅读 · 2019年10月17日

Connections between Support Vector Machines, Wasserstein distance and gradient-penalty GANs

专知会员服务

36+阅读 · 2019年10月17日

Deep Learning Based Detection and Correction of Cardiac MR Motion Artefacts During Reconstruction for High-Quality Segmentation

专知会员服务

59+阅读 · 2019年10月17日