Variational Autoencoders (VAEs) remain a cornerstone of generative computer vision, yet their training is often plagued by artifacts that degrade reconstruction and generation quality. This paper introduces VIVAT, a systematic approach to mitigating common artifacts in KL-VAE training without requiring radical architectural changes. We present a detailed taxonomy of five prevalent artifacts - color shift, grid patterns, blur, corner and droplet artifacts - and analyze their root causes. Through straightforward modifications, including adjustments to loss weights, padding strategies, and the integration of Spatially Conditional Normalization, we demonstrate significant improvements in VAE performance. Our method achieves state-of-the-art results in image reconstruction metrics (PSNR and SSIM) across multiple benchmarks and enhances text-to-image generation quality, as evidenced by superior CLIP scores. By preserving the simplicity of the KL-VAE framework while addressing its practical challenges, VIVAT offers actionable insights for researchers and practitioners aiming to optimize VAE training.
翻译:变分自编码器(VAEs)作为生成式计算机视觉的基石,其训练过程常受伪影干扰,导致重建与生成质量下降。本文提出VIVAT,一种在不改变核心架构的前提下系统缓解KL-VAE训练中常见伪影的方法。我们详细归纳了五种典型伪影——色彩偏移、网格纹样、模糊、边角伪影与液滴伪影,并剖析其成因。通过对损失权重、填充策略的调整及空间条件归一化的融合等简洁改进,我们显著提升了VAE性能。该方法在多个基准测试的图像重建指标(PSNR与SSIM)上达到最优结果,并通过更高的CLIP分数验证了文本到图像生成质量的提升。VIVAT在保持KL-VAE框架简洁性的同时解决了实际训练难题,为优化VAE训练的研究者与实践者提供了可操作的见解。