The dominance of denoising generative models (e.g., diffusion, flow-matching) in visual synthesis is tempered by their substantial training costs and inefficiencies in representation learning. While injecting discriminative representations via auxiliary alignment has proven effective, this approach still faces key limitations: the reliance on external, pre-trained encoders introduces overhead and domain shift. A dispersed-based strategy that encourages strong separation among in-batch latent representations alleviates this specific dependency. To assess the effect of the number of negative samples in generative modeling, we propose {\mname}, a plug-and-play training framework that requires no external encoders. Our method integrates a memory bank mechanism that maintains a large, dynamically updated queue of negative samples across training iterations. This decouples the number of negatives from the mini-batch size, providing abundant and high-quality negatives for a contrastive objective without a multiplicative increase in computational cost. A low-dimensional projection head is used to further minimize memory and bandwidth overhead. {\mname} offers three principal advantages: (1) it is self-contained, eliminating dependency on pretrained vision foundation models and their associated forward-pass overhead; (2) it introduces no additional parameters or computational cost during inference; and (3) it enables substantially faster convergence, achieving superior generative quality more efficiently. On ImageNet-256, {\mname} achieves a state-of-the-art FID of \textbf{2.40} within 400k steps, significantly outperforming comparable methods.
翻译:去噪生成模型(如扩散模型、流匹配)在视觉合成领域的主导地位因其高昂的训练成本和表征学习效率不足而受到制约。尽管通过辅助对齐注入判别性表征已被证明有效,但该方法仍面临关键局限:依赖外部预训练编码器会引入额外开销与领域偏移。一种基于分散化的策略通过增强批次内潜在表征的强分离性缓解了这一特定依赖。为评估生成建模中负样本数量的影响,我们提出Repulsor,一种无需外部编码器的即插即用训练框架。该方法集成了记忆库机制,在训练迭代中维护一个动态更新的大规模负样本队列,从而将负样本数量与小批次规模解耦,在不显著增加计算成本的前提下为对比目标提供丰富且高质量的负样本。采用低维投影头进一步最小化内存与带宽开销。Repulsor具备三大核心优势:(1)自包含性,无需依赖预训练视觉基础模型及其前向传播开销;(2)推理阶段不引入额外参数或计算成本;(3)实现显著加速收敛,以更高效率获得更优生成质量。在ImageNet-256数据集上,Repulsor在40万步训练内达到FID指标为2.40的先进水平,显著优于同类方法。