反思之谜：利用印度谜语评估多语言大语言模型的推理与自我意识能力 (The Riddle of Reflection: Evaluating Reasoning and Self-Awareness in Multilingual LLMs using Indian Riddles)

The extent to which large language models (LLMs) can perform culturally grounded reasoning across non-English languages remains underexplored. This paper examines the reasoning and self-assessment abilities of LLMs across seven major Indian languages-Bengali, Gujarati, Hindi, Kannada, Malayalam, Tamil, and Telugu. We introduce a multilingual riddle dataset combining traditional riddles with context-reconstructed variants and evaluate five LLMs-Gemini 2.5 Pro, Gemini 2.5 Flash, Mistral-Saba, LLaMA 4 Scout, and LLaMA 4 Maverick-under seven prompting strategies. In the first stage, we assess riddle-solving performance and find that while Gemini 2.5 Pro performs best overall, few-shot methods yield only marginal gains, and accuracy varies notably across languages. In the second stage, we conduct a self-evaluation experiment to measure reasoning consistency. The results reveal a key finding: a model's initial accuracy is inversely correlated with its ability to identify its own mistakes. Top-performing models such as Gemini 2.5 Pro are overconfident (4.34% True Negative Rate), whereas lower-performing models like LLaMA 4 Scout are substantially more self-aware (42.09% True Negative Rate). These results point to clear gaps in multilingual reasoning and highlight the need for models that not only reason effectively but also recognize their own limitations.

翻译：大语言模型（LLMs）在非英语语言中执行基于文化的推理能力，其程度仍未得到充分探索。本文考察了LLMs在七种主要印度语言——孟加拉语、古吉拉特语、印地语、卡纳达语、马拉雅拉姆语、泰米尔语和泰卢固语——中的推理与自我评估能力。我们引入了一个多语言谜语数据集，该数据集结合了传统谜语与上下文重构变体，并评估了五种LLMs——Gemini 2.5 Pro、Gemini 2.5 Flash、Mistral-Saba、LLaMA 4 Scout和LLaMA 4 Maverick——在七种提示策略下的表现。在第一阶段，我们评估了谜语解答性能，发现虽然Gemini 2.5 Pro整体表现最佳，但少样本方法仅带来边际收益，且准确率在不同语言间存在显著差异。在第二阶段，我们进行了自我评估实验以衡量推理一致性。结果揭示了一个关键发现：模型的初始准确率与其识别自身错误的能力呈负相关。表现最佳的模型如Gemini 2.5 Pro表现出过度自信（真阴性率为4.34%），而表现较差的模型如LLaMA 4 Scout则具有显著更高的自我意识（真阴性率为42.09%）。这些结果指出了多语言推理中的明显差距，并强调了需要开发不仅能有效推理，还能识别自身局限性的模型。

相关内容

Gemini

关注 0

2023年12 月 6 日，谷歌 CEO 桑达尔・皮查伊官宣 Gemini 1.0 版正式上线。这次发布的 Gemini 大模型是原生多模态大模型，是谷歌大模型新时代的第一步，它包括三种量级：能力最强的 Gemini Ultra，适用于多任务的 Gemini Pro 以及适用于特定任务和端侧的 Gemini Nano。

KG-BERT：基于BERT的知识图谱补全，KG-BERT: BERT for Knowledge Graph Completion

专知会员服务

195+阅读 · 2020年5月31日

语言视觉预训练语言模型揭密，Behind the Scene: Revealing the Secrets of Pre-trained Vision-and-Language Models

专知会员服务

36+阅读 · 2020年5月20日

【Google Research】Wavesplit:通过说话者聚类实现端到端的语音分离，Wavesplit: End-to-End Speech Separation by Speaker Clustering

专知会员服务

19+阅读 · 2020年2月26日

【ICLR2020 预训练的百科全书】弱监督的知识-预训练的语言模型（PRETRAINED ENCYCLOPEDIA: WEAKLY SUPERVISED KNOWLEDGE-PRETRAINED LANGUAGE MODEL）

专知会员服务

25+阅读 · 2019年12月26日