The extent to which large language models (LLMs) can perform culturally grounded reasoning across non-English languages remains underexplored. This paper examines the reasoning and self-assessment abilities of LLMs across seven major Indian languages-Bengali, Gujarati, Hindi, Kannada, Malayalam, Tamil, and Telugu. We introduce a multilingual riddle dataset combining traditional riddles with context-reconstructed variants and evaluate five LLMs-Gemini 2.5 Pro, Gemini 2.5 Flash, Mistral-Saba, LLaMA 4 Scout, and LLaMA 4 Maverick-under seven prompting strategies. In the first stage, we assess riddle-solving performance and find that while Gemini 2.5 Pro performs best overall, few-shot methods yield only marginal gains, and accuracy varies notably across languages. In the second stage, we conduct a self-evaluation experiment to measure reasoning consistency. The results reveal a key finding: a model's initial accuracy is inversely correlated with its ability to identify its own mistakes. Top-performing models such as Gemini 2.5 Pro are overconfident (4.34% True Negative Rate), whereas lower-performing models like LLaMA 4 Scout are substantially more self-aware (42.09% True Negative Rate). These results point to clear gaps in multilingual reasoning and highlight the need for models that not only reason effectively but also recognize their own limitations.
翻译:大语言模型(LLMs)在非英语语言中执行基于文化的推理能力,其程度仍未得到充分探索。本文考察了LLMs在七种主要印度语言——孟加拉语、古吉拉特语、印地语、卡纳达语、马拉雅拉姆语、泰米尔语和泰卢固语——中的推理与自我评估能力。我们引入了一个多语言谜语数据集,该数据集结合了传统谜语与上下文重构变体,并评估了五种LLMs——Gemini 2.5 Pro、Gemini 2.5 Flash、Mistral-Saba、LLaMA 4 Scout和LLaMA 4 Maverick——在七种提示策略下的表现。在第一阶段,我们评估了谜语解答性能,发现虽然Gemini 2.5 Pro整体表现最佳,但少样本方法仅带来边际收益,且准确率在不同语言间存在显著差异。在第二阶段,我们进行了自我评估实验以衡量推理一致性。结果揭示了一个关键发现:模型的初始准确率与其识别自身错误的能力呈负相关。表现最佳的模型如Gemini 2.5 Pro表现出过度自信(真阴性率为4.34%),而表现较差的模型如LLaMA 4 Scout则具有显著更高的自我意识(真阴性率为42.09%)。这些结果指出了多语言推理中的明显差距,并强调了需要开发不仅能有效推理,还能识别自身局限性的模型。