Stereophonic audio is an indispensable ingredient to enhance human auditory experience. Recent research has explored the usage of visual information as guidance to generate binaural or ambisonic audio from mono ones with stereo supervision. However, this fully supervised paradigm suffers from an inherent drawback: the recording of stereophonic audio usually requires delicate devices that are expensive for wide accessibility. To overcome this challenge, we propose to leverage the vastly available mono data to facilitate the generation of stereophonic audio. Our key observation is that the task of visually indicated audio separation also maps independent audios to their corresponding visual positions, which shares a similar objective with stereophonic audio generation. We integrate both stereo generation and source separation into a unified framework, Sep-Stereo, by considering source separation as a particular type of audio spatialization. Specifically, a novel associative pyramid network architecture is carefully designed for audio-visual feature fusion. Extensive experiments demonstrate that our framework can improve the stereophonic audio generation results while performing accurate sound separation with a shared backbone.
翻译:神经声学是提高人类听觉经验不可或缺的要素。最近的研究探索了视觉信息的使用,以作为向导,从单声波监听器生成双声或双声音。然而,这一受到充分监督的范式存在固有的缺陷:音响录音记录通常需要精密的装置,而这种装置对于广泛获取来说是昂贵的。为了克服这一挑战,我们提议利用大量可用的单数据来帮助生成声波听音。我们的主要观察是,视觉显示的音频分离任务也将独立音频映射到相应的视觉位置,这些音频与声波声生成具有相似的目标。我们将立声和源分离都纳入一个统一的框架,即Sep-Stereo,将源分离视为一种特殊的音频空间化类型。具体地说,一个新型的联动金字塔网络结构是为视听特征融合而精心设计的。广泛的实验表明,我们的框架可以改进声波生成结果,同时与共同的骨脊进行精确音断分离。