Large Code Models (LCMs) show potential in code intelligence, but their effectiveness is greatly influenced by prompt quality. Current prompt design is mostly manual, which is time-consuming and highly dependent on specific LCMs and tasks. While automated prompt generation (APG) exists in NLP, it is underexplored for code intelligence. This creates a gap, as automating the prompt process is essential for developers facing diverse tasks and black-box LCMs. To mitigate this, we empirically investigate two important parts of APG: Instruction Generation (IG) and Multi-Step Reasoning (MSR). IG provides a task-related description to instruct LCMs, while MSR guides them to produce logical steps before the final answer. We evaluate widely-used APG methods for each part on four open-source LCMs and three code intelligence tasks: code translation (PL-PL), code summarization (PL-NL), and API recommendation (NL-PL).Experimental results indicate that both IG and MSR dramatically enhance performance compared to basic prompts. Based on these results, we propose a novel APG approach combining the best methods of the two parts. Experiments show our approach achieves average improvements of 28.38% in CodeBLEU (code translation), 58.11% in ROUGE-L (code summarization), and 84.53% in SuccessRate@1 (API recommendation) over basic prompts. To validate its effectiveness in an industrial scenario, we evaluate our approach on WeChat-Bench, a proprietary dataset, achieving an average MRR improvement of 148.89% for API recommendation.
翻译:大型代码模型在代码智能任务中展现出潜力,但其性能很大程度上受提示质量的影响。当前的提示设计大多依赖人工,耗时且高度依赖于特定的大型代码模型和任务。尽管自动化提示生成在自然语言处理领域已有研究,但在代码智能中尚未得到充分探索。这形成了一个缺口,因为对于面临多样化任务和黑盒大型代码模型的开发者而言,自动化提示过程至关重要。为弥补这一不足,我们实证研究了自动化提示生成的两个重要组成部分:指令生成和多步推理。指令生成提供与任务相关的描述来指导大型代码模型,而多步推理则引导其在给出最终答案前生成逻辑步骤。我们在四个开源大型代码模型和三个代码智能任务上评估了针对每个部分广泛使用的自动化提示生成方法,这三个任务包括:代码翻译、代码摘要和API推荐。实验结果表明,与基础提示相比,指令生成和多步推理均能显著提升性能。基于这些结果,我们提出了一种新颖的自动化提示生成方法,该方法结合了两个部分的最佳方法。实验显示,与基础提示相比,我们的方法在代码翻译任务上平均提升了28.38%的CodeBLEU分数,在代码摘要任务上平均提升了58.11%的ROUGE-L分数,在API推荐任务上平均提升了84.53%的SuccessRate@1分数。为验证其在工业场景中的有效性,我们在WeChat-Bench这一专有数据集上评估了我们的方法,在API推荐任务上实现了平均148.89%的MRR提升。