Estimating the cognitive complexity of reading comprehension (RC) items is crucial for assessing item difficulty before it is administered to learners. Unlike syntactic and semantic features, such as passage length or semantic similarity between options, cognitive features that arise during answer reasoning are not readily extractable using existing NLP tools and have traditionally relied on human annotation. In this study, we examine whether large language models (LLMs) can estimate the cognitive complexity of RC items by focusing on two dimensions-Evidence Scope and Transformation Level-that indicate the degree of cognitive burden involved in reasoning about the answer. Our experimental results demonstrate that LLMs can approximate the cognitive complexity of items, indicating their potential as tools for prior difficulty analysis. Further analysis reveals a gap between LLMs' reasoning ability and their metacognitive awareness: even when they produce correct answers, they sometimes fail to correctly identify the features underlying their own reasoning process.
翻译:评估阅读理解(RC)题目的认知复杂度对于在题目施测前预测其难度至关重要。与句法和语义特征(如篇章长度或选项间的语义相似度)不同,答案推理过程中产生的认知特征难以通过现有自然语言处理工具直接提取,传统上主要依赖人工标注。本研究探讨大型语言模型(LLMs)能否通过聚焦两个维度——证据范围与转换层级(二者反映了答案推理所需的认知负荷程度)——来评估RC题目的认知复杂度。实验结果表明,LLMs能够近似评估题目的认知复杂度,展现了其作为先验难度分析工具的潜力。进一步分析揭示了LLMs的推理能力与其元认知意识之间的差距:即使模型能生成正确答案,有时仍无法准确识别支撑其自身推理过程的特征。