NL2Repo-Bench：面向编码智能体长周期仓库生成能力的评估基准 (NL2Repo-Bench: Towards Long-Horizon Repository Generation Evaluation of Coding Agents)

Jingzhe Ding,Shengda Long,Changxin Pu,Huan Zhou,Hongwan Gao,Xiang Gao,Chao He,Yue Hou,Fei Hu,Zhaojian Li,Weiran Shi,Zaiyuan Wang,Daoguang Zan,Chenchen Zhang,Xiaoxu Zhang,Qizhi Chen,Xianfu Cheng,Bo Deng,Qingshui Gu,Kai Hua,Juntao Lin,Pai Liu,Mingchen Li,Xuanguang Pan,Zifan Peng,Yujia Qin,Yong Shan,Zhewen Tan,Weihao Xie,Zihan Wang,Yishuo Yuan,Jiayu Zhang,Enduo Zhao,Yunfei Zhao,He Zhu,Chenyang Zou,Ming Ding,Jianpeng Jiao,Jiaheng Liu,Minghao Liu,Qian Liu,Chongyao Tao,Jian Yang,Tong Yang,Zhaoxiang Zhang,Xinjie Chen,Wenhao Huang,Ge Zhang

Recent advances in coding agents suggest rapid progress toward autonomous software development, yet existing benchmarks fail to rigorously evaluate the long-horizon capabilities required to build complete software systems. Most prior evaluations focus on localized code generation, scaffolded completion, or short-term repair tasks, leaving open the question of whether agents can sustain coherent reasoning, planning, and execution over the extended horizons demanded by real-world repository construction. To address this gap, we present NL2Repo Bench, a benchmark explicitly designed to evaluate the long-horizon repository generation ability of coding agents. Given only a single natural-language requirements document and an empty workspace, agents must autonomously design the architecture, manage dependencies, implement multi-module logic, and produce a fully installable Python library. Our experiments across state-of-the-art open- and closed-source models reveal that long-horizon repository generation remains largely unsolved: even the strongest agents achieve below 40% average test pass rates and rarely complete an entire repository correctly. Detailed analysis uncovers fundamental long-horizon failure modes, including premature termination, loss of global coherence, fragile cross-file dependencies, and inadequate planning over hundreds of interaction steps. NL2Repo Bench establishes a rigorous, verifiable testbed for measuring sustained agentic competence and highlights long-horizon reasoning as a central bottleneck for the next generation of autonomous coding agents.

翻译：近期编码智能体的进展表明，自主软件开发正快速推进，然而现有基准测试未能严格评估构建完整软件系统所需的长周期能力。大多数先前评估聚焦于局部代码生成、脚手架式补全或短期修复任务，未能解答智能体是否能在现实世界仓库构建所要求的扩展周期内，持续保持连贯的推理、规划与执行能力。为填补这一空白，我们提出NL2Repo Bench——一个专门用于评估编码智能体长周期仓库生成能力的基准测试。仅给定单一自然语言需求文档和空工作空间，智能体必须自主设计架构、管理依赖项、实现多模块逻辑，并生成可完整安装的Python库。我们对当前最先进的开源与闭源模型进行的实验表明，长周期仓库生成任务在很大程度上仍未解决：即使性能最强的智能体平均测试通过率也低于40%，且极少能正确完成整个仓库。详细分析揭示了根本性的长周期失效模式，包括过早终止、全局一致性丧失、脆弱的跨文件依赖关系，以及在数百个交互步骤中规划能力不足。NL2Repo Bench为衡量持续智能体能力建立了严谨可验证的测试平台，并凸显长周期推理是下一代自主编码智能体面临的核心瓶颈。