The rapid integration of Large Language Models (LLMs) into software engineering (SE) has revolutionized tasks like code generation, producing a massive volume of software artifacts. This surge has exposed a critical bottleneck: the lack of scalable, reliable methods to evaluate these outputs. Human evaluation is costly and time-consuming, while traditional automated metrics like BLEU fail to capture nuanced quality aspects. In response, the LLM-as-a-Judge paradigm - using LLMs for automated evaluation - has emerged. This approach leverages the advanced reasoning of LLMs, offering a path toward human-like nuance at automated scale. However, LLM-as-a-Judge research in SE is still in its early stages. This forward-looking SE 2030 paper aims to steer the community toward advancing LLM-as-a-Judge for evaluating LLM-generated software artifacts. We provide a literature review of existing SE studies, analyze their limitations, identify key research gaps, and outline a detailed roadmap. We envision these frameworks as reliable, robust, and scalable human surrogates capable of consistent, multi-faceted artifact evaluation by 2030. Our work aims to foster research and adoption of LLM-as-a-Judge frameworks, ultimately improving the scalability of software artifact evaluation.
翻译:大型语言模型(LLMs)在软件工程(SE)中的快速集成,已彻底改变了代码生成等任务,产生了海量的软件制品。这一激增暴露了一个关键瓶颈:缺乏可扩展且可靠的评估方法来衡量这些输出。人工评估成本高昂且耗时,而传统的自动化指标(如BLEU)无法捕捉细微的质量维度。为此,LLM-as-a-Judge范式——利用LLMs进行自动化评估——应运而生。该方法借助LLMs的高级推理能力,为在自动化规模上实现类人化的细微评估提供了路径。然而,LLM-as-a-Judge在软件工程中的研究仍处于早期阶段。这篇前瞻性的SE 2030论文旨在引导社区推进LLM-as-a-Judge在评估LLM生成软件制品方面的应用。我们提供了现有软件工程研究的文献综述,分析了其局限性,识别了关键的研究空白,并勾勒了详细的路线图。我们设想,到2030年,这些框架将成为可靠、鲁棒且可扩展的人类替代者,能够对软件制品进行一致且多方面的评估。我们的工作旨在促进LLM-as-a-Judge框架的研究与采用,最终提升软件制品评估的可扩展性。