Multi-token generation has emerged as a promising paradigm for accelerating transformer-based large model inference. Recent efforts primarily explore diffusion Large Language Models (dLLMs) for parallel decoding to reduce inference latency. To achieve AR-level generation quality, many techniques adapt AR models into dLLMs to enable parallel decoding. However, they suffer from limited speedup compared to AR models due to a pretrain-to-posttrain mismatch. Specifically, the masked data distribution in post-training deviates significantly from the real-world data distribution seen during pretraining, and dLLMs rely on bidirectional attention, which conflicts with the causal prior learned during pretraining and hinders the integration of exact KV cache reuse. To address this, we introduce Jacobi Forcing, a progressive distillation paradigm where models are trained on their own generated parallel decoding trajectories, smoothly shifting AR models into efficient parallel decoders while preserving their pretrained causal inference property. The models trained under this paradigm, Jacobi Forcing Model, achieves 3.8x wall-clock speedup on coding and math benchmarks with minimal loss in performance. Based on Jacobi Forcing Models' trajectory characteristics, we introduce multi-block decoding with rejection recycling, which enables up to 4.5x higher token acceptance count per iteration and nearly 4.0x wall-clock speedup, effectively trading additional compute for lower inference latency. Our code is available at https://github.com/hao-ai-lab/JacobiForcing.
翻译:多令牌生成已成为加速基于Transformer的大模型推理的一种有前景的范式。近期研究主要探索扩散大语言模型(dLLMs)以实现并行解码,从而降低推理延迟。为达到自回归(AR)级别的生成质量,许多技术将AR模型适配为dLLMs以实现并行解码。然而,由于预训练与后训练之间的不匹配,这些方法相较于AR模型的加速效果有限。具体而言,后训练中的掩码数据分布显著偏离预训练期间观察到的真实数据分布,且dLLMs依赖双向注意力机制,这与预训练期间习得的因果先验相冲突,阻碍了精确KV缓存重用的集成。为解决此问题,我们提出了雅可比强制(Jacobi Forcing),一种渐进式蒸馏范式,该范式在模型自身生成的并行解码轨迹上进行训练,将AR模型平滑地转化为高效的并行解码器,同时保留其预训练的因果推理特性。在此范式下训练的模型——雅可比强制模型(Jacobi Forcing Model),在代码和数学基准测试中实现了3.8倍的实时加速,且性能损失极小。基于雅可比强制模型的轨迹特性,我们引入了带拒绝回收的多块解码(multi-block decoding with rejection recycling),该技术使每次迭代的令牌接受数量提升高达4.5倍,并实现近4.0倍的实时加速,有效通过额外计算换取更低的推理延迟。我们的代码发布于https://github.com/hao-ai-lab/JacobiForcing。