自动化Android构建修复：通过领域专用工具弥合LLM智能体中的推理-执行鸿沟 (Automating Android Build Repair: Bridging the Reasoning-Execution Gap in LLM Agents with Domain-Specific Tools)

Android is the largest mobile platform, yet automatically building applications remains a practical challenge. While Large Language Models (LLMs) show promise for code repair, their use for fixing Android build errors remains underexplored. To address this gap, we first introduce AndroidBuildBench, a benchmark of 1,019 build failures curated from the commit histories of 43 open-source Android projects. Each problem is paired with a verified solution from a subsequent commit, ensuring that fixes are feasible. Second, we propose GradleFixer, an LLM agent with domain-specific tools for inspecting and manipulating the Gradle build environment. GradleFixer achieves a resolve rate of 81.4% (pass@1), significantly outperforming a state-of-the-art coding agent that relies on a general-purpose shell. GradleFixer's success suggests that while LLMs possess the high-level knowledge to solve these failures, they struggle to translate this knowledge into effective low-level actions using a general-purpose shell. We demonstrate the effectiveness of a strategy we term Tool Bridging, which replaces general-purpose shell commands with domain-aware abstractions. We hypothesize this approach works through two mechanisms: 1) it provides tools in an API-like format that LLMs use more reliably, and 2) it constrains the action space to relevant operations. This approach bridges the gap between the model's high-level reasoning and effective low-level execution.

翻译：Android是最大的移动平台，但自动化构建应用程序仍是一个实际挑战。尽管大型语言模型（LLMs）在代码修复方面展现出潜力，但其用于修复Android构建错误的应用仍未被充分探索。为填补这一空白，我们首先引入了AndroidBuildBench——一个包含1,019个构建失败的基准数据集，这些数据从43个开源Android项目的提交历史中整理而来。每个问题都与后续提交中已验证的解决方案配对，确保修复是可行的。其次，我们提出了GradleFixer——一个配备领域专用工具的LLM智能体，用于检查和操作Gradle构建环境。GradleFixer实现了81.4%的解决率（pass@1），显著优于依赖通用shell的最先进编码智能体。GradleFixer的成功表明，虽然LLMs具备解决这些故障的高层知识，但难以通过通用shell将这些知识转化为有效的底层操作。我们验证了一种称为“工具桥接”策略的有效性，该策略用领域感知的抽象层替代通用shell命令。我们假设该方法通过两种机制发挥作用：1）以类API格式提供LLMs能更可靠使用的工具；2）将操作空间约束在相关行为范围内。这种方法弥合了模型高层推理与有效底层执行之间的鸿沟。