利用逆向工程与大型语言模型从源代码生成软件架构描述 (Generating Software Architecture Description from Source Code using Reverse Engineering and Large Language Model)

Software Architecture Descriptions (SADs) are essential for managing the inherent complexity of modern software systems. They enable high-level architectural reasoning, guide design decisions, and facilitate effective communication among diverse stakeholders. However, in practice, SADs are often missing, outdated, or poorly aligned with the system's actual implementation. Consequently, developers are compelled to derive architectural insights directly from source code-a time-intensive process that increases cognitive load, slows new developer onboarding, and contributes to the gradual degradation of clarity over the system's lifetime. To address these issues, we propose a semi-automated generation of SADs from source code by integrating reverse engineering (RE) techniques with a Large Language Model (LLM). Our approach recovers both static and behavioral architectural views by extracting a comprehensive component diagram, filtering architecturally significant elements (core components) via prompt engineering, and generating state machine diagrams to model component behavior based on underlying code logic with few-shots prompting. This resulting views representation offer a scalable and maintainable alternative to traditional manual architectural documentation. This methodology, demonstrated using C++ examples, highlights the potent capability of LLMs to: 1) abstract the component diagram, thereby reducing the reliance on human expert involvement, and 2) accurately represent complex software behaviors, especially when enriched with domain-specific knowledge through few-shot prompting. These findings suggest a viable path toward significantly reducing manual effort while enhancing system understanding and long-term maintainability.

翻译：软件架构描述对于管理现代软件系统固有的复杂性至关重要。它们支持高层架构推理，指导设计决策，并促进不同利益相关者之间的有效沟通。然而在实践中，软件架构描述常常缺失、过时或与系统的实际实现严重脱节。因此，开发人员被迫直接从源代码推导架构信息——这是一个耗时且增加认知负荷的过程，不仅延缓新开发人员的融入，还会导致系统在其生命周期内清晰度逐渐退化。为解决这些问题，我们提出了一种通过整合逆向工程技术与大型语言模型的半自动化方法，从源代码生成软件架构描述。我们的方法通过提取完整的组件图、通过提示工程筛选具有架构重要性的元素（核心组件），并基于底层代码逻辑通过少样本提示生成状态机图来建模组件行为，从而恢复静态与行为架构视图。这种生成的视图表示提供了一种可扩展且可维护的替代方案，以取代传统的手动架构文档。该方法以C++示例进行演示，突显了大型语言模型在以下方面的强大能力：1）抽象组件图，从而减少对人类专家参与的依赖；2）准确表示复杂的软件行为，尤其是在通过少样本提示注入领域特定知识时。这些发现表明了一条可行的路径，可在显著减少人工投入的同时，提升系统理解与长期可维护性。