LLMs and Agents have achieved impressive progress in code generation, mathematical reasoning, and scientific discovery. However, existing benchmarks primarily measure correctness, overlooking the diversity of methods behind solutions. True innovation depends not only on producing correct answers but also on the originality of the approach. We present InnoGym, the first benchmark and framework designed to systematically evaluate the innovation potential of AI agents. InnoGym introduces two complementary metrics: performance gain, which measures improvement over the best-known solutions, and novelty, which captures methodological differences from prior approaches. The benchmark includes 18 carefully curated tasks from real-world engineering and scientific domains, each standardized through resource filtering, evaluator validation, and solution collection. In addition, we provide iGym, a unified execution environment for reproducible and long-horizon evaluations. Extensive experiments show that while some agents produce novel approaches, their lack of robustness limits performance gains. These results highlight a key gap between creativity and effectiveness, underscoring the need for benchmarks that evaluate both.
翻译:大语言模型(LLMs)与智能体在代码生成、数学推理和科学发现领域已取得显著进展。然而,现有基准主要衡量正确性,忽视了解决方案背后方法的多样性。真正的创新不仅取决于生成正确答案,还依赖于方法的原创性。我们提出了InnoGym,首个旨在系统评估AI智能体创新潜力的基准与框架。InnoGym引入了两个互补的指标:性能增益(衡量相对于已知最优解的改进程度)和新颖性(捕捉与先前方法的方法论差异)。该基准包含18个精心筛选的真实世界工程与科学领域任务,每个任务均通过资源过滤、评估器验证和解决方案收集进行标准化处理。此外,我们提供了iGym——一个用于可复现和长周期评估的统一执行环境。大量实验表明,尽管部分智能体能产生新颖方法,但其缺乏鲁棒性限制了性能增益。这些结果凸显了创造力与有效性之间的关键差距,强调了需要同时评估二者的基准体系。