Maintaining traceability links between software release notes and corresponding development artifacts, e.g., pull requests (PRs), commits, and issues, is essential for managing technical debt and ensuring maintainability. However, in open-source environments where contributors work remotely and asynchronously, establishing and maintaining these links is often error-prone, time-consuming, and frequently overlooked. Our empirical study of GitHub repositories revealed that 47% of release artifacts lacked traceability links, and 12% contained broken links. To address this gap, we first analyzed release notes to identify their What, Why, and How information and assessed how these align with PRs, commits, and issues. We curated a benchmark dataset consisting of 3,500 filtered and validated traceability link instances. Then, we implemented LLM-based approaches to automatically establish traceability links of three pairs between release note contents & PRs, release note contents & PRs and release note contents & issues. By combining the time proximity feature, the LLM-based approach, e.g., Gemini 1.5 Pro, achieved a high Precision@1 value of 0.73 for PR traceability recovery. To evaluate the usability and adoption potential of this approach, we conducted an online survey involving 33 open-source practitioners. 16% of respondents rated as very important, and 68% as somewhat important for traceability maintenance.
翻译:在软件发布说明与对应的开发制品(例如拉取请求、提交记录和问题报告)之间维护可追溯性链接,对于管理技术债务和确保可维护性至关重要。然而,在开源环境中,贡献者通常远程且异步工作,建立和维护这些链接往往容易出错、耗时且常被忽视。我们对GitHub仓库的实证研究表明,47%的发布制品缺乏可追溯性链接,12%包含损坏的链接。为填补这一空白,我们首先分析了发布说明,识别其'是什么'、'为什么'和'如何做'信息,并评估这些信息与拉取请求、提交记录和问题报告的匹配程度。我们构建了一个包含3,500个经过筛选和验证的可追溯性链接实例的基准数据集。随后,我们基于大语言模型的方法,自动建立了发布说明内容与拉取请求、发布说明内容与提交记录以及发布说明内容与问题报告之间的三组可追溯性链接。通过结合时间邻近性特征,基于大语言模型的方法(例如Gemini 1.5 Pro)在拉取请求可追溯性恢复任务中实现了高达0.73的Precision@1值。为评估该方法的可用性和应用潜力,我们开展了一项涉及33位开源从业者的在线调查。16%的受访者认为该方法对可追溯性维护'非常重要',68%认为'较为重要'。