Design Rationale (DR) for software architecture decisions refers to the reasoning underlying architectural choices, which provides valuable insights into the different phases of the architecting process throughout software development. However, in practice, DR is often inadequately documented due to a lack of motivation and effort from developers. With the recent advancements in Large Language Models (LLMs), their capabilities in text comprehension, reasoning, and generation may enable the generation and recovery of DR for architecture decisions. In this study, we evaluated the performance of LLMs in generating DR for architecture decisions. First, we collected 50 Stack Overflow (SO) posts, 25 GitHub issues, and 25 GitHub discussions related to architecture decisions to construct a dataset of 100 architecture-related problems. Then, we selected five LLMs to generate DR for the architecture decisions with three prompting strategies, including zero-shot, chain of thought (CoT), and LLM-based agents. With the DR provided by human experts as ground truth, the Precision of LLM-generated DR with the three prompting strategies ranges from 0.267 to 0.278, Recall from 0.627 to 0.715, and F1-score from 0.351 to 0.389. Additionally, 64.45% to 69.42% of the arguments of DR not mentioned by human experts are also helpful, 4.12% to 4.87% of the arguments have uncertain correctness, and 1.59% to 3.24% of the arguments are potentially misleading. To further understand the trustworthiness and applicability of LLM-generated DR in practice, we conducted semi-structured interviews with six practitioners. Based on the experimental and interview results, we discussed the pros and cons of the three prompting strategies, the strengths and limitations of LLM-generated DR, and the implications for the practical use of LLM-generated DR.
翻译:软件架构决策的设计原理(DR)指的是支撑架构选择的推理过程,它为软件开发过程中架构设计各阶段提供了有价值的洞见。然而在实践中,由于开发者缺乏记录动机和投入精力,设计原理往往未能得到充分记录。随着大型语言模型(LLMs)的最新进展,其在文本理解、推理和生成方面的能力可能为架构决策的设计原理生成与恢复提供支持。本研究评估了LLMs在生成架构决策设计原理方面的性能。首先,我们收集了50篇Stack Overflow(SO)帖子、25个GitHub议题以及25个GitHub讨论,这些内容均与架构决策相关,由此构建了一个包含100个架构相关问题的数据集。随后,我们选取了五种LLMs,采用三种提示策略(包括零样本、思维链(CoT)和基于LLM的智能体)来生成架构决策的设计原理。以人类专家提供的设计原理作为基准真值,三种提示策略下LLM生成设计原理的精确率介于0.267至0.278之间,召回率介于0.627至0.715之间,F1分数介于0.351至0.389之间。此外,人类专家未提及的设计原理论点中,有64.45%至69.42%仍具有参考价值,4.12%至4.87%的论点正确性存疑,1.59%至3.24%的论点可能存在误导性。为进一步理解LLM生成设计原理在实际应用中的可信度与适用性,我们对六位从业者进行了半结构化访谈。基于实验与访谈结果,我们探讨了三种提示策略的优缺点、LLM生成设计原理的优势与局限,以及LLM生成设计原理在实际应用中的启示。