Large language models have achieved remarkable capabilities, but their practical deployment is hindered by significant computational costs. While adaptive computation methods like early-exiting promise to reduce these costs, they introduce a fundamental conflict: the per-token dynamism intended to save computation often creates system-level bottlenecks that can paradoxically reduce throughput in batched inference. This dissertation resolves this conflict by co-designing adaptive algorithms and model architectures to strike an optimal balance between dynamism and efficiency. To this end, our work first addresses critical sources of overhead in conventional early-exiting by proposing an efficient parallel decoding mechanism. We then show that deep parameter sharing provides an architectural foundation that not only yields compact, parameter-efficient models but also inherently mitigates the critical synchronization issues affecting dynamic inference. Finally, this work presents a unified framework where lightweight routers are pretrained to dynamically assign an optimal recursion depth for each token. This approach establishes a new Pareto frontier between efficiency and performance by effectively optimizing for both adaptive computation and parameter efficiency within a single model.
翻译:大语言模型已展现出卓越的能力,但其实际部署受到显著计算成本的制约。尽管提前退出等自适应计算方法有望降低这些成本,它们引入了一个根本性冲突:旨在节省计算的每令牌动态性往往会造成系统级瓶颈,在批处理推理中反而可能降低吞吐量。本论文通过协同设计自适应算法与模型架构,解决了这一冲突,实现了动态性与效率之间的最优平衡。为此,我们的工作首先针对传统提前退出中的关键开销源,提出了一种高效的并行解码机制。随后,我们证明深度参数共享提供了一种架构基础,不仅能生成紧凑、参数高效的模型,还能从根本上缓解影响动态推理的关键同步问题。最后,本研究提出了一个统一框架,其中轻量级路由器经过预训练,可为每个令牌动态分配最优递归深度。该方法通过在同一模型内有效优化自适应计算与参数效率,在效率与性能之间建立了新的帕累托前沿。