Most current large language models (LLMs) support a wide variety of languages in addition to English, including high-resource languages (e.g. German, Chinese, French), as well as low-resource ones (e.g. Swahili, Telugu). In addition they have also shown impressive capabilities in different domains, like coding, science and math. In this short paper, taking math as an example domain, we study the performance of different LLMs across languages. Experimental results show that there exists a non-negligible and consistent gap in the performance of the models across languages. Interestingly, and somewhat against expectations, the gap exists for both high- and low-resource languages. We hope that these results influence further research into cross-lingual capability generalization for next generation LLMs. If it weren't for the fact that they are false! By analyzing one of the standard multilingual math benchmarks (MGSM), we determine that several translation errors are present in the data. Furthermore, the lack of standardized answer extraction from LLM outputs further influences the final results. We propose a method for automatic quality assurance to address the first issue at scale, and give recommendations to address the second one. Combining these two approaches we show that the aforementioned language gap mostly disappears, leading to completely different conclusions from our research. We additionally release the corrected dataset to the community.
翻译:当前大多数大规模语言模型(LLMs)除英语外还支持多种语言,既包括高资源语言(如德语、中文、法语),也涵盖低资源语言(如斯瓦希里语、泰卢固语)。这些模型在代码生成、科学推理与数学解题等领域亦展现出卓越能力。本文以数学领域为例,系统研究了不同LLMs在跨语言场景下的性能表现。实验结果表明,模型在不同语言间存在显著且持续的性能差距。值得注意的是,这一现象同时出现在高资源与低资源语言中,与部分预期相悖。我们期待此发现能推动下一代LLMs跨语言能力泛化的相关研究——若非发现这些结论基于错误数据!通过对标准多语言数学基准测试集(MGSM)的分析,我们确认数据中存在多处翻译错误。此外,LLM输出结果缺乏标准化的答案提取流程进一步影响了最终评估。针对前者,我们提出一种可规模化实施的自动质量保障方法;针对后者,我们给出具体改进建议。综合应用这两种方案后,前述语言性能差距基本消失,从而得出与初始研究截然不同的结论。我们同时向学术界公开修正后的数据集。