从“聚酯女友”到“盲鼠”：构建斯洛文尼亚语首个语用理解基准 (From Polyester Girlfriends to Blind Mice: Creating the First Pragmatics Understanding Benchmarks for Slovene)

Large language models are demonstrating increasing capabilities, excelling at benchmarks once considered very difficult. As their capabilities grow, there is a need for more challenging evaluations that go beyond surface-level linguistic competence. Namely, language competence involves not only syntax and semantics but also pragmatics, i.e., understanding situational meaning as shaped by context as well as linguistic and cultural norms. To contribute to this line of research, we introduce SloPragEval and SloPragMega, the first pragmatics understanding benchmarks for Slovene that contain altogether 405 multiple-choice questions. We discuss the difficulties of translation, describe the campaign to establish a human baseline, and report pilot evaluations with LLMs. Our results indicate that current models have greatly improved in understanding nuanced language but may still fail to infer implied speaker meaning in non-literal utterances, especially those that are culture-specific. We also observe a significant gap between proprietary and open-source models. Finally, we argue that benchmarks targeting nuanced language understanding and knowledge of the target culture must be designed with care, preferably constructed from native data, and validated with human responses.

翻译：大型语言模型正展现出日益增强的能力，在曾被视为极具挑战性的基准测试中表现出色。随着其能力的增长，需要超越表层语言能力的更具挑战性的评估。具体而言，语言能力不仅涉及句法和语义，还包括语用学，即理解由语境以及语言和文化规范所塑造的情境意义。为了推动这一研究方向，我们引入了SloPragEval和SloPragMega，这是首个针对斯洛文尼亚语的语用理解基准，共包含405道多项选择题。我们讨论了翻译的难点，描述了建立人类基线的活动，并报告了使用LLMs进行的初步评估。我们的结果表明，当前模型在理解细微语言方面已取得显著进步，但在推断非字面表达（尤其是具有文化特定性的表达）中隐含的说话者意图时仍可能失败。我们还观察到专有模型与开源模型之间存在显著差距。最后，我们认为，针对细微语言理解和目标文化知识的基准必须精心设计，最好基于本土数据构建，并通过人类响应进行验证。