The generation speed of LLMs are bottlenecked by autoregressive decoding, where tokens are predicted sequentially one by one. Alternatively, diffusion large language models (dLLMs) theoretically allow for parallel token generation, but in practice struggle to achieve the speed of autoregressive models without significantly sacrificing quality. We therefore introduce adaptive parallel decoding (APD), a novel method that dynamically adjusts the number of tokens sampled in parallel. We achieve this by defining a multiplicative mixture between the dLLM marginal probabilities and the joint probability of sequences under a small auxiliary autoregressive model. This inverts the standard setup of speculative decoding, where the goal is to sample from a large autoregressive verifier by drafting from a smaller model. We further optimize APD by enabling KV caching and limiting the size of the masked input. Altogether, our method puts forward three tunable parameters to flexibly tradeoff throughput and quality. We show that APD provides markedly higher throughput with minimal quality degradations on downstream benchmarks.
翻译:大语言模型的生成速度受限于自回归解码,其中令牌被逐个顺序预测。作为替代方案,扩散大语言模型理论上允许并行令牌生成,但在实践中若不显著牺牲质量则难以达到自回归模型的速度。因此,我们引入了自适应并行解码,这是一种动态调整并行采样令牌数量的新方法。我们通过定义扩散大语言模型边际概率与小型辅助自回归模型下序列联合概率之间的乘法混合来实现这一点。这颠倒了推测解码的标准设置,后者的目标是通过从较小模型草拟来从大型自回归验证器采样。我们进一步通过启用KV缓存和限制掩码输入大小来优化自适应并行解码。总体而言,我们的方法提出了三个可调参数,以灵活权衡吞吐量和质量。我们证明,自适应并行解码在下游基准测试中提供了显著更高的吞吐量,且质量下降最小。