In recent years, large language models (LLMs) have demonstrated strong performance on multilingual tasks. Given its wide range of applications, cross-cultural understanding capability is a crucial competency. However, existing benchmarks for evaluating whether LLMs genuinely possess this capability suffer from three key limitations: a lack of contextual scenarios, insufficient cross-cultural concept mapping, and limited deep cultural reasoning capabilities. To address these gaps, we propose SAGE, a scenario-based benchmark built via cross-cultural core concept alignment and generative task design, to evaluate LLMs' cross-cultural understanding and reasoning. Grounded in cultural theory, we categorize cross-cultural capabilities into nine dimensions. Using this framework, we curated 210 core concepts and constructed 4530 test items across 15 specific real-world scenarios, organized under four broader categories of cross-cultural situations, following established item design principles. The SAGE dataset supports continuous expansion, and experiments confirm its transferability to other languages. It reveals model weaknesses across both dimensions and scenarios, exposing systematic limitations in cross-cultural reasoning. While progress has been made, LLMs are still some distance away from reaching a truly nuanced cross-cultural understanding. In compliance with the anonymity policy, we include data and code in the supplement materials. In future versions, we will make them publicly available online.
翻译:近年来,大型语言模型(LLMs)在多语言任务上展现出强劲性能。鉴于其广泛的应用范围,跨文化理解能力成为一项关键素养。然而,现有评估LLMs是否真正具备该能力的基准存在三个主要局限:缺乏情境化场景、跨文化概念映射不足以及深层文化推理能力有限。为填补这些空白,我们提出SAGE——一个基于跨文化核心概念对齐与生成式任务设计构建的情境化基准,用于评估LLMs的跨文化理解与推理能力。基于文化理论,我们将跨文化能力划分为九个维度。在此框架下,我们筛选了210个核心概念,并遵循既定的题目设计原则,在四大类跨文化情境范畴下,围绕15个具体现实场景构建了4530个测试项目。SAGE数据集支持持续扩展,实验证实其具备跨语言可迁移性。该基准揭示了模型在维度与场景上的系统性缺陷,暴露出跨文化推理的系统性局限。尽管已取得进展,LLMs距离实现真正细致的跨文化理解仍有差距。根据匿名政策要求,我们将数据与代码置于补充材料中,后续版本将在线上公开。