Recent advances in video generation have achieved remarkable improvements in visual content fidelity. However, the absence of synchronized audio severely undermines immersive experience and restricts practical applications of these technologies. To address this challenge, several pioneering works have explored diffusion transformer architectures for generating plausible video-synchronized audio, including Kling-foley, HunyuanVideo-foley and Thinksound. Distinct from existing works, we introduce an autoregressive audio generation architecture (DreamFoley) that harnesses the capabilities of large vision-language models (VLMs) to jointly model sequential interactions among video, audio, and text modalities. Our approach features a dual-visual encoder module that effectively captures both audio-aligned and text-aligned visual features. Additionally, we employ a Residual Vector Quantization audio tokenizer with a delay-pattern generation scheme to balance the trade-off between training efficiency and audio quality. Moreover, we introduce the classifier-free guidance strategy into VLMs to bootstrap generated audio quality. Furthermore, we establish an efficient data production pipeline to scale audio-video-text triple collection. Finally, extensive experiments are conducted to validate the effectiveness of our model, achieving promising performance across popular benchmarks. We hope that the findings in this study provide a strong foundation for future video-to-audio generation research. We also release the previously missing audio-visual textual descriptions from the public benchmark, aiming to facilitate subsequent researchers in conducting more convenient and effective evaluations and comparisons.
翻译:视频生成领域的最新进展在视觉内容保真度方面取得了显著提升。然而,同步音频的缺失严重削弱了沉浸式体验,并限制了这些技术的实际应用。为应对这一挑战,已有若干开创性工作探索了基于扩散Transformer架构来生成合理的视频同步音频,包括Kling-foley、HunyuanVideo-foley和Thinksound。与现有研究不同,我们提出了一种自回归音频生成架构(DreamFoley),该架构利用大规模视觉语言模型的能力,联合建模视频、音频和文本模态间的序列交互。我们的方法采用双视觉编码器模块,有效捕捉音频对齐和文本对齐的视觉特征。此外,我们使用带有延迟模式生成方案的残差向量量化音频分词器,以平衡训练效率与音频质量之间的权衡。同时,我们将无分类器引导策略引入视觉语言模型,以提升生成音频的质量。进一步地,我们建立了一个高效的数据生产流水线,以规模化收集音频-视频-文本三元组。最后,通过大量实验验证了我们模型的有效性,在主流基准测试中取得了优异性能。我们希望本研究中的发现能为未来视频到音频生成研究奠定坚实基础。我们还公开了此前缺失的公共基准测试中的视听文本描述,旨在为后续研究者进行更便捷有效的评估与比较提供便利。