MERA Code：一个跨任务代码生成的统一评估框架 (MERA Code: A Unified Framework for Evaluating Code Generation Across Tasks)

Artem Chervyakov,Alexander Kharitonov,Pavel Zadorozhny,Adamenko Pavel,Rodion Levichev,Dmitrii Vorobev,Dmitrii Salikhov,Aidar Valeev,Alena Pestova,Maria Dziuba,Ilseyar Alimova,Artem Zavgorodnev,Aleksandr Medvedev,Stanislav Moiseev,Elena Bruches,Daniil Grebenkin,Roman Derunets,Vikulov Vladimir,Anton Emelyanov,Dmitrii Babaev,Vladimir V. Ivanov,Valentin Malykh,Alena Fenogenova

Advancements in LLMs have enhanced task automation in software engineering; however, current evaluations primarily focus on natural language tasks, overlooking code quality. Most benchmarks prioritize high-level reasoning over executable code and real-world performance, leaving gaps in understanding true capabilities and risks associated with these models in production. To address this issue, we propose MERA Code, a new addition to the MERA benchmark family, specifically focused on evaluating code for the latest code generation LLMs in Russian. This benchmark includes 11 evaluation tasks that span 8 programming languages. Our proposed evaluation methodology features a taxonomy that outlines the practical coding skills necessary for models to complete these tasks. The benchmark comprises an open-source codebase for users to conduct MERA assessments, a scoring system compatible with various programming environments, and a platform featuring a leaderboard and submission system. We evaluate open LLMs and frontier API models, analyzing their limitations in terms of practical coding tasks in non-English languages. We are publicly releasing MERA to guide future research, anticipate groundbreaking features in model development, and standardize evaluation procedures.

翻译：大型语言模型（LLMs）的进步提升了软件工程中的任务自动化能力；然而，当前的评估主要集中于自然语言任务，忽视了代码质量。大多数基准测试优先考虑高级推理而非可执行代码和实际性能，导致在理解这些模型在生产环境中的真实能力与相关风险方面存在空白。为解决这一问题，我们提出了MERA Code，作为MERA基准系列的新增部分，专门用于评估俄语最新代码生成LLMs的代码能力。该基准包含涵盖8种编程语言的11项评估任务。我们提出的评估方法采用了一种分类法，概述了模型完成这些任务所需的实际编码技能。该基准包括一个供用户进行MERA评估的开源代码库、一个兼容多种编程环境的评分系统，以及一个包含排行榜和提交系统的平台。我们评估了开源LLMs和前沿API模型，分析了它们在非英语语言实际编码任务方面的局限性。我们公开发布MERA，以指导未来研究、预测模型开发中的突破性特征，并标准化评估流程。

相关内容