The scalability of high-fidelity video diffusion models (VDMs) is constrained by two key sources of redundancy: the quadratic complexity of global spatio-temporal attention and the computational overhead of long iterative denoising trajectories. Existing accelerators -- such as sparse attention and step-distilled samplers -- typically target a single dimension in isolation and quickly encounter diminishing returns, as the remaining bottlenecks become dominant. In this work, we introduce USV (Unified Sparsification for Video diffusion models), an end-to-end trainable framework that overcomes this limitation by jointly orchestrating sparsification across both the model's internal computation and its sampling process. USV learns a dynamic, data- and timestep-dependent sparsification policy that prunes redundant attention connections, adaptively merges semantically similar tokens, and reduces denoising steps, treating them not as independent tricks but as coordinated actions within a single optimization objective. This multi-dimensional co-design enables strong mutual reinforcement among previously disjoint acceleration strategies. Extensive experiments on large-scale video generation benchmarks demonstrate that USV achieves up to 83.3% speedup in the denoising process and 22.7% end-to-end acceleration, while maintaining high visual fidelity. Our results highlight unified, dynamic sparsification as a practical path toward efficient, high-quality video generation.
翻译:高保真视频扩散模型(VDMs)的可扩展性受到两个关键冗余来源的限制:全局时空注意力的二次复杂度以及长迭代去噪轨迹的计算开销。现有加速器——例如稀疏注意力和步数蒸馏采样器——通常仅针对单一维度进行优化,并很快遇到收益递减的问题,因为剩余的瓶颈成为主导。在本工作中,我们提出了USV(视频扩散模型的统一稀疏化),这是一种端到端可训练的框架,通过联合协调模型内部计算及其采样过程的稀疏化来克服这一限制。USV学习一种动态的、依赖于数据和时间步的稀疏化策略,该策略剪枝冗余的注意力连接,自适应地合并语义相似的令牌,并减少去噪步数,将它们视为单一优化目标内的协调行动,而非独立的技巧。这种多维度的协同设计使得先前互不相干的加速策略之间能够实现强有力的相互增强。在大规模视频生成基准上的大量实验表明,USV在去噪过程中实现了高达83.3%的加速,端到端加速达到22.7%,同时保持了高视觉保真度。我们的结果突显了统一、动态的稀疏化作为实现高效、高质量视频生成的实用路径。