In context learning (ICL) underpins recent advances in large language models (LLMs), although its role and performance in causal reasoning remains unclear. Causal reasoning demands multihop composition and strict conjunctive control, and reliance on spurious lexical relations of the input could provide misleading results. We hypothesize that, due to their ability to project the input into a latent space, encoder and encoder decoder architectures are better suited for said multihop conjunctive reasoning versus decoder only models. To do this, we compare fine-tuned versions of all the aforementioned architectures with zero and few shot ICL in both natural language and non natural language scenarios. We find that ICL alone is insufficient for reliable causal reasoning, often overfocusing on irrelevant input features. In particular, decoder only models are noticeably brittle to distributional shifts, while finetuned encoder and encoder decoder models can generalize more robustly across our tests, including the non natural language split. Both architectures are only matched or surpassed by decoder only architectures at large scales. We conclude by noting that for cost effective, short horizon robust causal reasoning, encoder or encoder decoder architectures with targeted finetuning are preferable.
翻译:上下文学习(ICL)支撑了近期大语言模型(LLM)的进展,但其在因果推理中的作用与性能仍不明确。因果推理需要多跳组合与严格的合取控制,而依赖输入中的虚假词汇关联可能导致误导性结果。我们假设,由于能够将输入投影至潜在空间,编码器及编码器-解码器架构相较于仅解码器模型更适合上述多跳合取推理。为此,我们在自然语言与非自然语言场景下,比较了上述所有架构的微调版本与零样本及少样本ICL的表现。研究发现,仅依赖ICL不足以实现可靠的因果推理,其常过度关注无关的输入特征。具体而言,仅解码器模型对分布偏移表现出明显的脆弱性,而经过微调的编码器与编码器-解码器模型能在包括非自然语言分割在内的测试中实现更稳健的泛化。仅当模型规模极大时,仅解码器架构的性能才与这两种架构相当或超越。最后我们指出,对于成本效益高、短期稳健的因果推理任务,采用经过针对性微调的编码器或编码器-解码器架构更为适宜。