Joint editing of audio and visual content is crucial for precise and controllable content creation. This new task poses challenges due to the limitations of paired audio-visual data before and after targeted edits, and the heterogeneity across modalities. To address the data and modeling challenges in joint audio-visual editing, we introduce SAVEBench, a paired audiovisual dataset with text and mask conditions to enable object-grounded source-to-target learning. With SAVEBench, we train the Schrodinger Audio-Visual Editor (SAVE), an end-to-end flow-matching model that edits audio and video in parallel while keeping them aligned throughout processing. SAVE incorporates a Schrodinger Bridge that learns a direct transport from source to target audiovisual mixtures. Our evaluation demonstrates that the proposed SAVE model is able to remove the target objects in audio and visual content while preserving the remaining content, with stronger temporal synchronization and audiovisual semantic correspondence compared with pairwise combinations of an audio editor and a video editor.
翻译:音频与视觉内容的联合编辑对于实现精确且可控的内容创作至关重要。这一新任务面临挑战,主要源于目标编辑前后成对音视频数据的稀缺性以及跨模态的异质性。为应对联合音视频编辑中的数据与建模难题,我们引入了SAVEBench——一个具备文本和掩码条件的成对音视频数据集,以支持基于对象的源到目标学习。基于SAVEBench,我们训练了薛定谔音视频编辑器(SAVE),这是一个端到端的流匹配模型,能够并行编辑音频和视频,并在整个处理过程中保持二者的对齐。SAVE整合了薛定谔桥,该桥学习从源到目标音视频混合物的直接传输路径。评估结果表明,所提出的SAVE模型能够有效移除音频和视觉内容中的目标对象,同时保留其余内容,相较于音频编辑器与视频编辑器的两两组合,SAVE展现出更强的时间同步性和音视频语义对应关系。