AraLingBench：用于评估大语言模型阿拉伯语语言能力的人工标注基准 (AraLingBench A Human-Annotated Benchmark for Evaluating Arabic Linguistic Capabilities of Large Language Models)

We present AraLingBench: a fully human annotated benchmark for evaluating the Arabic linguistic competence of large language models (LLMs). The benchmark spans five core categories: grammar, morphology, spelling, reading comprehension, and syntax, through 150 expert-designed multiple choice questions that directly assess structural language understanding. Evaluating 35 Arabic and bilingual LLMs reveals that current models demonstrate strong surface level proficiency but struggle with deeper grammatical and syntactic reasoning. AraLingBench highlights a persistent gap between high scores on knowledge-based benchmarks and true linguistic mastery, showing that many models succeed through memorization or pattern recognition rather than authentic comprehension. By isolating and measuring fundamental linguistic skills, AraLingBench provides a diagnostic framework for developing Arabic LLMs. The full evaluation code is publicly available on GitHub.

翻译：我们提出了AraLingBench：一个完全由人工标注的基准，用于评估大语言模型（LLMs）的阿拉伯语语言能力。该基准涵盖五个核心类别：语法、形态学、拼写、阅读理解和句法，通过150道专家设计的多项选择题直接评估结构化的语言理解能力。对35个阿拉伯语及双语大语言模型的评估表明，当前模型表现出较强的表层语言熟练度，但在深层语法和句法推理方面存在困难。AraLingBench凸显了基于知识的基准测试高分与真正语言掌握之间的持续差距，表明许多模型的成功依赖于记忆或模式识别，而非真正的理解。通过分离和测量基础语言技能，AraLingBench为开发阿拉伯语大语言模型提供了一个诊断框架。完整的评估代码已在GitHub上公开提供。