U-REPA：将扩散U-Net与ViT对齐 (U-REPA: Aligning Diffusion U-Nets to ViTs)

Representation Alignment (REPA) that aligns Diffusion Transformer (DiT) hidden-states with ViT visual encoders has proven highly effective in DiT training, demonstrating superior convergence properties, but it has not been validated on the canonical diffusion U-Net architecture that shows faster convergence compared to DiTs. However, adapting REPA to U-Net architectures presents unique challenges: (1) different block functionalities necessitate revised alignment strategies; (2) spatial-dimension inconsistencies emerge from U-Net's spatial downsampling operations; (3) space gaps between U-Net and ViT hinder the effectiveness of tokenwise alignment. To encounter these challenges, we propose \textbf{U-REPA}, a representation alignment paradigm that bridges U-Net hidden states and ViT features as follows: Firstly, we propose via observation that due to skip connection, the middle stage of U-Net is the best alignment option. Secondly, we propose upsampling of U-Net features after passing them through MLPs. Thirdly, we observe difficulty when performing tokenwise similarity alignment, and further introduces a manifold loss that regularizes the relative similarity between samples. Experiments indicate that the resulting U-REPA could achieve excellent generation quality and greatly accelerates the convergence speed. With CFG guidance interval, U-REPA could reach $FID<1.5$ in 200 epochs or 1M iterations on ImageNet 256 $\times$ 256, and needs only half the total epochs to perform better than REPA under sd-vae-ft-ema. Codes: https://github.com/YuchuanTian/U-REPA

翻译：表示对齐（REPA）通过将扩散Transformer（DiT）的隐藏状态与ViT视觉编码器对齐，已在DiT训练中被证明极为有效，展现出卓越的收敛特性，但该方法尚未在经典的扩散U-Net架构上得到验证，而U-Net相比DiT具有更快的收敛速度。然而，将REPA适配到U-Net架构面临独特挑战：（1）不同模块的功能差异需要调整对齐策略；（2）U-Net的空间下采样操作导致空间维度不一致；（3）U-Net与ViT之间的空间间隙阻碍了基于令牌的对齐效果。为应对这些挑战，我们提出\\textbf{U-REPA}，一种连接U-Net隐藏状态与ViT特征的表示对齐范式，具体如下：首先，通过观察发现，由于跳跃连接的存在，U-Net的中间阶段是最佳对齐选择。其次，提出在U-Net特征通过MLP后进行上采样。第三，观察到执行令牌级相似性对齐存在困难，进一步引入流形损失以规范化样本间的相对相似性。实验表明，所提出的U-REPA能够实现优异的生成质量，并大幅加速收敛速度。在CFG引导区间下，U-REPA在ImageNet 256 $\\times$ 256数据集上仅需200轮或100万次迭代即可达到$FID<1.5$，且在sd-vae-ft-ema设置下仅需REPA一半的总轮数即可取得更优性能。代码：https://github.com/YuchuanTian/U-REPA