Distribution Matching Distillation (DMD) distills score-based generative models into efficient one-step generators, without requiring a one-to-one correspondence with the sampling trajectories of their teachers. However, limited model capacity causes one-step distilled models underperform on complex generative tasks, e.g., synthesizing intricate object motions in text-to-video generation. Directly extending DMD to multi-step distillation increases memory usage and computational depth, leading to instability and reduced efficiency. While prior works propose stochastic gradient truncation as a potential solution, we observe that it substantially reduces the generation diversity of multi-step distilled models, bringing it down to the level of their one-step counterparts. To address these limitations, we propose Phased DMD, a multi-step distillation framework that bridges the idea of phase-wise distillation with Mixture-of-Experts (MoE), reducing learning difficulty while enhancing model capacity. Phased DMD is built upon two key ideas: progressive distribution matching and score matching within subintervals. First, our model divides the SNR range into subintervals, progressively refining the model to higher SNR levels, to better capture complex distributions. Next, to ensure the training objective within each subinterval is accurate, we have conducted rigorous mathematical derivations. We validate Phased DMD by distilling state-of-the-art image and video generation models, including Qwen-Image (20B parameters) and Wan2.2 (28B parameters). Experimental results demonstrate that Phased DMD preserves output diversity better than DMD while retaining key generative capabilities. We will release our code and models.
翻译:分布匹配蒸馏(DMD)将基于分数的生成模型蒸馏为高效的单步生成器,无需与其教师模型的采样轨迹保持一一对应关系。然而,有限的模型容量导致单步蒸馏模型在复杂生成任务(例如文本到视频生成中合成精细物体运动)上表现不佳。直接将DMD扩展到多步蒸馏会增加内存使用和计算深度,导致不稳定性和效率降低。虽然先前工作提出随机梯度截断作为潜在解决方案,但我们观察到这会显著降低多步蒸馏模型的生成多样性,使其降至与其单步对应模型相当的水平。为应对这些局限性,我们提出Phased DMD——一种融合分阶段蒸馏与专家混合(MoE)思想的多步蒸馏框架,在降低学习难度的同时增强模型容量。Phased DMD基于两个核心思想:渐进式分布匹配和子区间内分数匹配。首先,我们的模型将信噪比范围划分为子区间,逐步将模型优化至更高信噪比水平,以更好地捕捉复杂分布。其次,为确保每个子区间内的训练目标精确性,我们进行了严格的数学推导。我们通过蒸馏最先进的图像和视频生成模型(包括Qwen-Image(200亿参数)和Wan2.2(280亿参数))验证了Phased DMD的有效性。实验结果表明,Phased DMD在保持关键生成能力的同时,比DMD更好地保留了输出多样性。我们将公开代码和模型。