Chain-of-Thought (CoT) prompting has been widely recognized for its ability to enhance reasoning capabilities in large language models (LLMs). However, our study reveals a surprising contradiction to this prevailing perspective within the fundamental domain of pattern-based in-context learning (ICL). Through extensive experiments involving 16 state-of-the-art LLMs and nine diverse pattern-based ICL datasets, we demonstrate that CoT and its reasoning variants consistently underperform direct answering across varying model scales and benchmark complexities. To systematically investigate this unexpected phenomenon, we designed extensive experiments to validate several hypothetical explanations. Our analysis uncovers a fundamental hybrid mechanism of explicit-implicit reasoning driving CoT's performance in pattern-based ICL: while explicit reasoning falters due to LLMs' struggles to infer underlying patterns from demonstrations, implicit reasoning-disrupted by the increased contextual distance of CoT rationales-often compensates, delivering correct answers despite flawed rationales. This hybrid mechanism explains CoT's relative underperformance, as noise from weak explicit inference undermines the process, even as implicit mechanisms partially salvage outcomes. Notably, even long-CoT reasoning models, which excel in abstract and symbolic reasoning, fail to fully overcome these limitations despite higher computational costs. Our findings challenge existing assumptions regarding the universal efficacy of CoT, yielding novel insights into its limitations and guiding future research toward more nuanced and effective reasoning methodologies for LLMs.
翻译:思维链(CoT)提示因其增强大型语言模型(LLMs)推理能力而广受认可。然而,我们的研究在基于模式的上下文学习(ICL)这一基础领域中揭示了与这一主流观点相悖的意外现象。通过对16个先进LLMs和九个多样化基于模式的ICL数据集进行大量实验,我们证明CoT及其推理变体在不同模型规模和基准复杂度下均持续逊色于直接回答。为系统探究这一意外现象,我们设计了大量实验以验证若干假设性解释。我们的分析揭示了驱动CoT在基于模式ICL中性能表现的根本机制——显隐式推理混合机制:当显式推理因LLMs难以从演示中推断底层模式而失效时,被CoT理由增加的上下文距离所干扰的隐式推理往往能进行补偿,从而在理由存在缺陷的情况下仍给出正确答案。这种混合机制解释了CoT相对表现不佳的原因:即便隐式机制能部分挽救结果,薄弱显式推理产生的噪声仍会破坏整个推理过程。值得注意的是,即便在抽象和符号推理中表现优异的长思维链推理模型,尽管计算成本更高,也未能完全克服这些局限性。我们的发现挑战了关于CoT普适有效性的现有假设,为其局限性提供了新的见解,并为未来研究指向更精细有效的LLMs推理方法提供了指引。