We introduce Dynamic Nested Depth (DND), a novel method that improves performance for off-the-shelf LLMs by selecting critical tokens to reprocess in a nested depth manner. Specifically, at the end of the given transformer layer, DND identifies more critical tokens with a router and feeds them back for an extra round of processing, effectively ``reviewing" difficult tokens while avoiding redundant computation for easier ones. The dynamic selection mechanism is tailored for precise control via two novel strategies: a router controlling loss to enhance token selection distinguishability, and a threshold control scheme to ensure selection stability. We demonstrate the effectiveness of DND by directly integrating it into pre-trained dense and MoE models during a post-training phase. On diverse benchmarks, this approach boosts the performances of the dense Qwen3-1.7B by 1.88% and the MoE Qwen3-30B-A3B by 0.87%, all with a minimal parameter and computing increase.
翻译:本文提出动态嵌套深度(Dynamic Nested Depth,DND)方法,这是一种通过以嵌套深度方式选择关键令牌进行再处理来提升现成大型语言模型性能的新技术。具体而言,在给定Transformer层结束时,DND通过路由器识别更关键的令牌,并将其反馈进行额外轮次处理,从而有效“审阅”困难令牌,同时避免对简单令牌的冗余计算。该动态选择机制通过两种新颖策略实现精确控制:采用路由器控制损失以增强令牌选择区分度,以及阈值控制方案确保选择稳定性。我们通过在预训练后阶段将DND直接集成到预训练的稠密模型和混合专家模型中验证其有效性。在多样化基准测试中,该方法使稠密模型Qwen3-1.7B性能提升1.88%,混合专家模型Qwen3-30B-A3B性能提升0.87%,且仅需极少的参数和计算量增加。