Diffusion models have achieved remarkable success in image generation, yet their deployment remains constrained by the heavy computational cost and the need for numerous inference steps. Previous efforts on fewer-step distillation attempt to skip redundant steps by training compact student models, yet they often suffer from heavy retraining costs and degraded generalization. In this work, we take a different perspective: we accelerate smartly, not evenly, applying smaller speedups to early semantic stages and larger ones to later redundant phases. We instantiate this phase-aware strategy with two experts that specialize in slow and fast denoising phases. Surprisingly, instead of investing massive effort in retraining student models, we find that simply equipping the base model with lightweight LoRA adapters achieves both efficient acceleration and strong generalization. We refer to these two adapters as Slow-LoRA and Fast-LoRA. Through extensive experiments, our method achieves up to 5 acceleration over the base model while maintaining comparable visual quality across diverse benchmarks. Remarkably, the LoRA experts are trained with only 1 samples on a single V100 within one hour, yet the resulting models generalize strongly on unseen prompts.
翻译:扩散模型在图像生成领域取得了显著成功,但其部署仍受限于高昂的计算成本和大量推理步骤的需求。先前关于少步蒸馏的研究尝试通过训练紧凑的学生模型来跳过冗余步骤,但往往面临繁重的重新训练成本与泛化性能下降的问题。本研究采用不同视角:我们进行智能而非均匀的加速,对早期语义阶段施加较小的加速,而对后期冗余阶段施加较大的加速。我们通过两个分别专注于慢速与快速去噪阶段的专家模型来实现这一阶段感知策略。令人惊讶的是,我们发现无需投入大量精力重新训练学生模型,仅为基础模型配备轻量级的LoRA适配器即可同时实现高效加速与强大的泛化能力。我们将这两个适配器分别命名为Slow-LoRA与Fast-LoRA。通过大量实验,我们的方法在保持多样基准测试中视觉质量可比性的同时,实现了相对于基础模型最高5倍的加速。值得注意的是,LoRA专家模型仅需在单个V100 GPU上使用1个样本训练一小时,所得模型在未见过的提示词上仍展现出强大的泛化性能。