Large language models (LLMs) are increasingly used in scientific domains. While they can produce reasoning-like content via methods such as chain-of-thought prompting, these outputs are typically unstructured and informal, obscuring whether models truly understand the fundamental reasoning paradigms that underpin scientific inference. To address this, we introduce a novel task named Latent Reasoning Chain Extraction (ARCHE), in which models must decompose complex reasoning arguments into combinations of standard reasoning paradigms in the form of a Reasoning Logic Tree (RLT). In RLT, all reasoning steps are explicitly categorized as one of three variants of Peirce's fundamental inference modes: deduction, induction, or abduction. To facilitate this task, we release ARCHE Bench, a new benchmark derived from 70 Nature Communications articles, including more than 1,900 references and 38,000 viewpoints. We propose two logic-aware evaluation metrics: Entity Coverage (EC) for content completeness and Reasoning Edge Accuracy (REA) for step-by-step logical validity. Evaluations on 10 leading LLMs on ARCHE Bench reveal that models exhibit a trade-off between REA and EC, and none are yet able to extract a complete and standard reasoning chain. These findings highlight a substantial gap between the abilities of current reasoning models and the rigor required for scientific argumentation.
翻译:大语言模型(LLMs)在科学领域的应用日益广泛。尽管它们可以通过思维链提示等方法生成类似推理的内容,但这些输出通常是非结构化和非正式的,难以判断模型是否真正理解支撑科学推理的基本推理范式。为解决这一问题,我们提出了一项名为潜在推理链提取(ARCHE)的新任务,要求模型将复杂的推理论证分解为标准推理范式的组合,并以推理逻辑树(RLT)的形式呈现。在RLT中,所有推理步骤被明确归类为皮尔斯基本推理模式的三种变体之一:演绎、归纳或溯因。为支持此任务,我们发布了ARCHE Bench,这是一个基于70篇《自然·通讯》文章构建的新基准,包含超过1,900条参考文献和38,000个观点。我们提出了两种逻辑感知评估指标:用于内容完整性的实体覆盖率(EC),以及用于逐步逻辑有效性的推理边准确率(REA)。在ARCHE Bench上对10个领先LLMs的评估表明,模型在REA和EC之间存在权衡,且目前尚无模型能够提取完整且标准的推理链。这些发现凸显了当前推理模型的能力与科学论证所需的严谨性之间存在显著差距。