基于语言模型自动识别问题报告讨论中的解决方案相关内容 (Automatically Identifying Solution-Related Content in Issue Report Discussions with Language Models)

During issue resolution, software developers rely on issue reports to discuss solutions for defects, feature requests, and other changes. These discussions contain proposed solutions-from design changes to code implementations-as well as their evaluations. Locating solution-related content is essential for investigating reopened issues, addressing regressions, reusing solutions, and understanding code change rationale. Manually understanding long discussions to identify such content can be difficult and time-consuming. This paper automates solution identification using language models as supervised classifiers. We investigate three applications-embeddings, prompting, and fine-tuning-across three classifier types: traditional ML models (MLMs), pre-trained language models (PLMs), and large language models (LLMs). Using 356 Mozilla Firefox issues, we created a dataset to train and evaluate six MLMs, four PLMs, and two LLMs across 68 configurations. Results show that MLMs with LLM embeddings outperform TF-IDF features, prompting underperforms, and fine-tuned LLMs achieve the highest performance, with LLAMAft reaching 0.716 F1 score. Ensembles of the best models further improve results (0.737 F1). Misclassifications often arise from misleading clues or missing context, highlighting the need for context-aware classifiers. Models trained on Mozilla transfer to other projects, with a small amount of project-specific data, further enhancing results. This work supports software maintenance, issue understanding, and solution reuse.

翻译：在问题解决过程中，软件开发者依赖问题报告来讨论缺陷修复、功能请求及其他变更的解决方案。这些讨论包含从设计变更到代码实现等各类提议的解决方案及其评估。定位解决方案相关内容对于调查重开问题、处理回归缺陷、复用解决方案以及理解代码变更缘由至关重要。人工理解冗长讨论以识别此类内容往往困难且耗时。本文利用语言模型作为监督分类器，实现了解决方案识别的自动化。我们研究了三种应用——嵌入、提示与微调——并覆盖三类分类器：传统机器学习模型（MLM）、预训练语言模型（PLM）及大语言模型（LLM）。基于356个Mozilla Firefox问题，我们构建了数据集，在68种配置下训练并评估了六种MLM、四种PLM和两种LLM。结果表明：采用LLM嵌入的MLM优于TF-IDF特征，提示方法表现欠佳，而微调后的LLM达到最高性能，其中LLAMAft的F1分数达0.716。最佳模型的集成进一步提升了效果（F1分数0.737）。错误分类常源于误导性线索或上下文缺失，凸显了上下文感知分类器的必要性。基于Mozilla数据训练的模型可迁移至其他项目，辅以少量项目特定数据可进一步提升效果。本研究为软件维护、问题理解与方案复用提供了支持。