Video-to-audio generation (V2A) is of increasing importance in domains such as film post-production, AR/VR, and sound design, particularly for the creation of Foley sound effects synchronized with on-screen actions. Foley requires generating audio that is both semantically aligned with visible events and temporally aligned with their timing. Yet, there is a mismatch between evaluation and downstream applications due to the absence of a benchmark tailored to Foley-style scenarios. We find that 74% of videos from past evaluation datasets have poor audio-visual correspondence. Moreover, they are dominated by speech and music, domains that lie outside the use case for Foley. To address this gap, we introduce FoleyBench, the first large-scale benchmark explicitly designed for Foley-style V2A evaluation. FoleyBench contains 5,000 (video, ground-truth audio, text caption) triplets, each featuring visible sound sources with audio causally tied to on-screen events. The dataset is built using an automated, scalable pipeline applied to in-the-wild internet videos from YouTube-based and Vimeo-based sources. Compared to past datasets, we show that videos from FoleyBench have stronger coverage of sound categories from a taxonomy specifically designed for Foley sound. Each clip is further labeled with metadata capturing source complexity, UCS/AudioSet category, and video length, enabling fine-grained analysis of model performance and failure modes. We benchmark several state-of-the-art V2A models, evaluating them on audio quality, audio-video alignment, temporal synchronization, and audio-text consistency. Samples are available at: https://gclef-cmu.org/foleybench
翻译:视频到音频生成(V2A)在电影后期制作、增强现实/虚拟现实(AR/VR)以及声音设计等领域的重要性日益凸显,尤其适用于生成与屏幕动作同步的拟音(Foley)音效。拟音要求生成的音频不仅在语义上与可见事件对齐,还需在时间上与其发生时刻精确同步。然而,由于缺乏专门针对拟音场景设计的基准测试集,当前评估方法与下游应用之间存在脱节。我们发现,过去评估数据集中74%的视频存在音画对应关系不佳的问题。此外,这些数据集主要由语音和音乐主导,而这些领域并非拟音的主要应用场景。为填补这一空白,我们推出了FoleyBench,这是首个专门为拟音式V2A评估设计的大规模基准测试集。FoleyBench包含5000个(视频、真实音频、文本描述)三元组,每个三元组均呈现可见的声源,且音频与屏幕事件具有因果关联。该数据集通过一个自动化、可扩展的流程构建,源数据来自YouTube和Vimeo等平台的互联网野生视频。与以往数据集相比,FoleyBench中的视频在专门为拟音设计的分类体系中覆盖了更全面的声音类别。每个片段还附有元数据标签,包括声源复杂度、UCS/AudioSet类别和视频长度,支持对模型性能及失效模式进行细粒度分析。我们对多种先进的V2A模型进行了基准测试,评估指标涵盖音频质量、音画对齐、时间同步性以及音频-文本一致性。相关示例可在以下网址查看:https://gclef-cmu.org/foleybench