The scaling of large language models (LLMs) emphasizes increasing depth, yet performance gains diminish with added layers. Prior work introduces the concept of "effective depth", arguing that deeper models fail to fully utilize their layers for meaningful computation. Building on this, we systematically study how effective depth varies with model scale, training type, and task difficulty. First, we analyze the model behavior of Qwen-2.5 family (1.5B-32B) and find that while the number of effective layers grows with model size, the effective depth ratio remains stable. Besides, comparisons between base and corresponding long-CoT models show no increase in effective depth, suggesting that improved reasoning stems from longer context rather than deeper per-token computation. Furthermore, evaluations across tasks of varying difficulty indicate that models do not dynamically use more layers for harder problems. Our results suggest that current LLMs underuse available depth across scales, training paradigms and tasks of varying difficulties, pointing out research opportunities on increasing the layer utilization rate of LLMs, model pruning, and early exiting. Our code is released at https://github.com/AheadOFpotato/what_affects_effective_depth.
翻译:大语言模型(LLMs)的扩展强调增加深度,但性能提升随层数增加而减弱。先前研究引入了“有效深度”的概念,认为更深层的模型未能充分利用其层进行有意义的计算。在此基础上,我们系统研究了有效深度如何随模型规模、训练类型和任务难度变化。首先,我们分析了Qwen-2.5系列(1.5B-32B)的模型行为,发现虽然有效层数随模型规模增长,但有效深度比率保持稳定。此外,基础模型与对应长链思维(long-CoT)模型的比较显示有效深度并未增加,这表明推理能力的提升源于更长的上下文而非更深层的单标记计算。进一步地,对不同难度任务的评估表明,模型并未针对更困难问题动态使用更多层。我们的结果表明,当前LLMs在不同规模、训练范式和任务难度下均未充分利用可用深度,这指出了在提高LLMs层利用率、模型剪枝和早期退出方面的研究机遇。代码发布于https://github.com/AheadOFpotato/what_affects_effective_depth。