FlashI2V：傅里叶引导的潜在偏移防止图像到视频生成中的条件图像泄露 (FlashI2V: Fourier-Guided Latent Shifting Prevents Conditional Image Leakage in Image-to-Video Generation)

In Image-to-Video (I2V) generation, a video is created using an input image as the first-frame condition. Existing I2V methods concatenate the full information of the conditional image with noisy latents to achieve high fidelity. However, the denoisers in these methods tend to shortcut the conditional image, which is known as conditional image leakage, leading to performance degradation issues such as slow motion and color inconsistency. In this work, we further clarify that conditional image leakage leads to overfitting to in-domain data and decreases the performance in out-of-domain scenarios. Moreover, we introduce Fourier-Guided Latent Shifting I2V, named FlashI2V, to prevent conditional image leakage. Concretely, FlashI2V consists of: (1) Latent Shifting. We modify the source and target distributions of flow matching by subtracting the conditional image information from the noisy latents, thereby incorporating the condition implicitly. (2) Fourier Guidance. We use high-frequency magnitude features obtained by the Fourier Transform to accelerate convergence and enable the adjustment of detail levels in the generated video. Experimental results show that our method effectively overcomes conditional image leakage and achieves the best generalization and performance on out-of-domain data among various I2V paradigms. With only 1.3B parameters, FlashI2V achieves a dynamic degree score of 53.01 on Vbench-I2V, surpassing CogVideoX1.5-5B-I2V and Wan2.1-I2V-14B-480P. Project page: https://pku-yuangroup.github.io/FlashI2V/

翻译：在图像到视频（I2V）生成中，视频是以输入图像作为首帧条件创建的。现有的I2V方法将条件图像的完整信息与噪声潜在表示拼接以实现高保真度。然而，这些方法中的去噪器倾向于走捷径直接利用条件图像，即所谓的条件图像泄露，导致性能下降问题，如运动缓慢和色彩不一致。在本工作中，我们进一步阐明条件图像泄露会导致对域内数据的过拟合，并降低在域外场景下的性能。此外，我们引入了傅里叶引导的潜在偏移I2V，命名为FlashI2V，以防止条件图像泄露。具体而言，FlashI2V包括：（1）潜在偏移。我们通过从噪声潜在表示中减去条件图像信息来修改流匹配的源分布和目标分布，从而隐式地融入条件。（2）傅里叶引导。我们利用傅里叶变换获得的高频幅度特征来加速收敛，并能够调整生成视频的细节水平。实验结果表明，我们的方法有效克服了条件图像泄露，并在各种I2V范式中，在域外数据上实现了最佳的泛化能力和性能。仅使用13亿参数，FlashI2V在Vbench-I2V上实现了53.01的动态度分数，超越了CogVideoX1.5-5B-I2V和Wan2.1-I2V-14B-480P。项目页面：https://pku-yuangroup.github.io/FlashI2V/