Latent diffusion models (LDMs) achieve state-of-the-art image synthesis, yet their reconstruction-style denoising objective provides only indirect semantic supervision: high-level semantics emerge slowly, requiring longer training and limiting sample quality. Recent works inject semantics from Vision Foundation Models (VFMs) either externally via representation alignment or internally by jointly modeling only a narrow slice of VFM features inside the diffusion process, under-utilizing the rich, nonlinear, multi-layer spatial semantics available. We introduce REGLUE (Representation Entanglement with Global-Local Unified Encoding), a unified latent diffusion framework that jointly models (i) VAE image latents, (ii) compact local (patch-level) VFM semantics, and (iii) a global (image-level) [CLS] token within a single SiT backbone. A lightweight convolutional semantic compressor nonlinearly aggregates multi-layer VFM features into a low-dimensional, spatially structured representation, which is entangled with the VAE latents in the diffusion process. An external alignment loss further regularizes internal representations toward frozen VFM targets. On ImageNet 256x256, REGLUE consistently improves FID and accelerates convergence over SiT-B/2 and SiT-XL/2 baselines, as well as over REPA, ReDi, and REG. Extensive experiments show that (a) spatial VFM semantics are crucial, (b) non-linear compression is key to unlocking their full benefit, and (c) global tokens and external alignment act as complementary, lightweight enhancements within our global-local-latent joint modeling framework. The code is available at https://github.com/giorgospets/reglue .
翻译:潜在扩散模型(LDMs)在图像合成领域取得了最先进的性能,但其重建式去噪目标仅提供间接的语义监督:高层语义的涌现过程缓慢,需要更长的训练时间并限制了生成样本的质量。近期研究通过视觉基础模型(VFMs)注入语义信息,其方式可分为两类:一是通过表征对齐进行外部注入,二是仅在扩散过程中联合建模VFM特征的狭窄切片进行内部注入,这两种方法均未能充分利用VFM所具备的丰富、非线性、多层空间语义。本文提出REGLUE(全局-局部统一编码的表征纠缠框架),这是一个统一的潜在扩散框架,可在单个SiT主干网络中联合建模:(i)VAE图像潜在表示,(ii)紧凑的局部(块级)VFM语义,以及(iii)全局(图像级)[CLS]标记。我们设计了一个轻量级卷积语义压缩器,将多层VFM特征非线性聚合为低维且具有空间结构的表征,该表征在扩散过程中与VAE潜在表示进行纠缠。此外,通过外部对齐损失进一步将内部表征正则化至冻结的VFM目标。在ImageNet 256×256数据集上的实验表明,相较于SiT-B/2和SiT-XL/2基线模型,以及REPA、ReDi和REG方法,REGLUE在FID指标上持续提升并加速了收敛过程。大量实验验证了:(a)空间VFM语义至关重要,(b)非线性压缩是充分发挥其效益的关键,(c)全局标记与外部对齐在本框架的全局-局部-潜在联合建模中起到互补的轻量化增强作用。代码已开源:https://github.com/giorgospets/reglue。