Sep-Setero:通过联合源分离,可视向导声波音频生成 (Sep-Stereo: Visually Guided Stereophonic Audio Generation by Associating Source Separation) - 专知论文

会员服务 ·

0

Sep-Stereo · 分离的 · Mono · Extensibility · Guidance ·

2020 年 7 月 20 日

Sep-Stereo: Visually Guided Stereophonic Audio Generation by Associating Source Separation

翻译：Sep-Setero:通过联合源分离,可视向导声波音频生成

Hang Zhou,Xudong Xu,Dahua Lin,Xiaogang Wang,Ziwei Liu

from arxiv, To appear in Proceedings of the European Conference on Computer Vision (ECCV), 2020. Code, models, and video results are available on our webpage: https://hangz-nju-cuhk.github.io/projects/Sep-Stereo

Stereophonic audio is an indispensable ingredient to enhance human auditory experience. Recent research has explored the usage of visual information as guidance to generate binaural or ambisonic audio from mono ones with stereo supervision. However, this fully supervised paradigm suffers from an inherent drawback: the recording of stereophonic audio usually requires delicate devices that are expensive for wide accessibility. To overcome this challenge, we propose to leverage the vastly available mono data to facilitate the generation of stereophonic audio. Our key observation is that the task of visually indicated audio separation also maps independent audios to their corresponding visual positions, which shares a similar objective with stereophonic audio generation. We integrate both stereo generation and source separation into a unified framework, Sep-Stereo, by considering source separation as a particular type of audio spatialization. Specifically, a novel associative pyramid network architecture is carefully designed for audio-visual feature fusion. Extensive experiments demonstrate that our framework can improve the stereophonic audio generation results while performing accurate sound separation with a shared backbone.

翻译：神经声学是提高人类听觉经验不可或缺的要素。最近的研究探索了视觉信息的使用,以作为向导,从单声波监听器生成双声或双声音。然而,这一受到充分监督的范式存在固有的缺陷:音响录音记录通常需要精密的装置,而这种装置对于广泛获取来说是昂贵的。为了克服这一挑战,我们提议利用大量可用的单数据来帮助生成声波听音。我们的主要观察是,视觉显示的音频分离任务也将独立音频映射到相应的视觉位置,这些音频与声波声生成具有相似的目标。我们将立声和源分离都纳入一个统一的框架,即Sep-Stereo,将源分离视为一种特殊的音频空间化类型。具体地说,一个新型的联动金字塔网络结构是为视听特征融合而精心设计的。广泛的实验表明,我们的框架可以改进声波生成结果,同时与共同的骨脊进行精确音断分离。

0

相关内容

Sep-Stereo

一图搞定ML！2020版机器学习技术路线图，35页ppt

一图搞定ML！2020版机器学习技术路线图，35页ppt

专知会员服务

94+阅读 · 2020年7月28日

商业数据分析，39页ppt

商业数据分析，39页ppt

专知会员服务

165+阅读 · 2020年6月2日

因果图，Causal Graphs，52页ppt

因果图，Causal Graphs，52页ppt

专知会员服务

253+阅读 · 2020年4月19日

【Google Research】Wavesplit:通过说话者聚类实现端到端的语音分离，Wavesplit: End-to-End Speech Separation by Speaker Clustering

【Google Research】Wavesplit:通过说话者聚类实现端到端的语音分离，Wavesplit: End-to-End Speech Separation by Speaker Clustering

专知会员服务

19+阅读 · 2020年2月26日

【CVPR2020】用于细粒度动作识别的多模式域自适应，Multi-Modal Domain Adaptation for Fine-Grained Action Recognition

【CVPR2020】用于细粒度动作识别的多模式域自适应，Multi-Modal Domain Adaptation for Fine-Grained Action Recognition

专知会员服务

78+阅读 · 2020年2月25日

【斯坦福大学】具有共同注意力的对抗性跨域动作识别（Adversarial Cross-Domain Action Recognition with Co-Attention）

【斯坦福大学】具有共同注意力的对抗性跨域动作识别（Adversarial Cross-Domain Action Recognition with Co-Attention）

专知会员服务

38+阅读 · 2019年12月26日

【ACM Multimedia 2019 Tutorial】机器学习音频和多媒体数据的再现性和实验设计（Reproducibility and Experimental Design for Machine Learning on Audio and Multimedia Data），Gerald Friedland

【ACM Multimedia 2019 Tutorial】机器学习音频和多媒体数据的再现性和实验设计（Reproducibility and Experimental Design for Machine Learning on Audio and Multimedia Data），Gerald Friedland

专知会员服务

5+阅读 · 2019年11月18日

Auto-Sizing the Transformer Network: Improving Speed, Efficiency, and Performance for Low-Resource Machine Translation

Auto-Sizing the Transformer Network: Improving Speed, Efficiency, and Performance for Low-Resource Machine Translation

专知会员服务

49+阅读 · 2019年10月17日

Deep Learning Based Detection and Correction of Cardiac MR Motion Artefacts During Reconstruction for High-Quality Segmentation

Deep Learning Based Detection and Correction of Cardiac MR Motion Artefacts During Reconstruction for High-Quality Segmentation

专知会员服务

59+阅读 · 2019年10月17日

【加州大学伯克利分校博士论文】通过自我监督预测学习泛化

【加州大学伯克利分校博士论文】通过自我监督预测学习泛化

专知会员服务

65+阅读 · 2019年10月9日

灾难性遗忘问题新视角：迁移-干扰平衡

灾难性遗忘问题新视角：迁移-干扰平衡

CreateAMind

17+阅读 · 2019年7月6日

Hierarchically Structured Meta-learning

Hierarchically Structured Meta-learning

CreateAMind

27+阅读 · 2019年5月22日

Transferring Knowledge across Learning Processes

Transferring Knowledge across Learning Processes

CreateAMind

29+阅读 · 2019年5月18日

无监督元学习表示学习

无监督元学习表示学习

CreateAMind

27+阅读 · 2019年1月4日

Unsupervised Learning via Meta-Learning

Unsupervised Learning via Meta-Learning

CreateAMind

43+阅读 · 2019年1月3日

【泡泡一分钟】基于视频修复的时空转换网络

【泡泡一分钟】基于视频修复的时空转换网络

泡泡机器人SLAM

5+阅读 · 2018年12月30日

Capsule Networks解析

Capsule Networks解析

机器学习研究会

11+阅读 · 2017年11月12日

MoCoGAN 分解运动和内容的视频生成

MoCoGAN 分解运动和内容的视频生成

CreateAMind

18+阅读 · 2017年10月21日

【ICCV 2017论文集】计算机视觉顶级会议ICCV2017 Open Access Repository

【ICCV 2017论文集】计算机视觉顶级会议ICCV2017 Open Access Repository

专知

6+阅读 · 2017年10月14日

Auto-Encoding GAN

Auto-Encoding GAN

CreateAMind

7+阅读 · 2017年8月4日

Gated Channel Transformation for Visual Recognition

Arxiv

4+阅读 · 2020年3月27日

Visual Grounding in Video for Unsupervised Word Translation

Visual Grounding in Video for Unsupervised Word Translation

Arxiv

7+阅读 · 2020年3月11日

Generating Question Relevant Captions to Aid Visual Question Answering

Generating Question Relevant Captions to Aid Visual Question Answering

Arxiv

5+阅读 · 2019年9月9日

An End-to-End Baseline for Video Captioning

Arxiv

6+阅读 · 2019年4月4日

3D Hand Shape and Pose Estimation from a Single RGB Image

3D Hand Shape and Pose Estimation from a Single RGB Image

Arxiv

17+阅读 · 2019年3月3日

Improving Neural Question Generation using Answer Separation

Improving Neural Question Generation using Answer Separation

Arxiv

3+阅读 · 2018年9月7日

Learning to Generate and Reconstruct 3D Meshes with only 2D Supervision

Learning to Generate and Reconstruct 3D Meshes with only 2D Supervision

Arxiv

5+阅读 · 2018年7月24日

Reconstruction Network for Video Captioning

Arxiv

5+阅读 · 2018年3月30日

A Unified Method for First and Third Person Action Recognition

Arxiv

3+阅读 · 2017年12月30日

Integrating both Visual and Audio Cues for Enhanced Video Caption

Arxiv

4+阅读 · 2017年12月9日

VIP会员

文章信息

相关主题

相关VIP内容

一图搞定ML！2020版机器学习技术路线图，35页ppt

一图搞定ML！2020版机器学习技术路线图，35页ppt

专知会员服务

94+阅读 · 2020年7月28日

商业数据分析，39页ppt

商业数据分析，39页ppt

专知会员服务

165+阅读 · 2020年6月2日

因果图，Causal Graphs，52页ppt

因果图，Causal Graphs，52页ppt

专知会员服务

253+阅读 · 2020年4月19日

【Google Research】Wavesplit:通过说话者聚类实现端到端的语音分离，Wavesplit: End-to-End Speech Separation by Speaker Clustering

【Google Research】Wavesplit:通过说话者聚类实现端到端的语音分离，Wavesplit: End-to-End Speech Separation by Speaker Clustering

专知会员服务

19+阅读 · 2020年2月26日

【CVPR2020】用于细粒度动作识别的多模式域自适应，Multi-Modal Domain Adaptation for Fine-Grained Action Recognition

【CVPR2020】用于细粒度动作识别的多模式域自适应，Multi-Modal Domain Adaptation for Fine-Grained Action Recognition

专知会员服务

78+阅读 · 2020年2月25日

【斯坦福大学】具有共同注意力的对抗性跨域动作识别（Adversarial Cross-Domain Action Recognition with Co-Attention）

【斯坦福大学】具有共同注意力的对抗性跨域动作识别（Adversarial Cross-Domain Action Recognition with Co-Attention）

专知会员服务

38+阅读 · 2019年12月26日

【ACM Multimedia 2019 Tutorial】机器学习音频和多媒体数据的再现性和实验设计（Reproducibility and Experimental Design for Machine Learning on Audio and Multimedia Data），Gerald Friedland

【ACM Multimedia 2019 Tutorial】机器学习音频和多媒体数据的再现性和实验设计（Reproducibility and Experimental Design for Machine Learning on Audio and Multimedia Data），Gerald Friedland

专知会员服务

5+阅读 · 2019年11月18日

Auto-Sizing the Transformer Network: Improving Speed, Efficiency, and Performance for Low-Resource Machine Translation

Auto-Sizing the Transformer Network: Improving Speed, Efficiency, and Performance for Low-Resource Machine Translation

专知会员服务

49+阅读 · 2019年10月17日

Deep Learning Based Detection and Correction of Cardiac MR Motion Artefacts During Reconstruction for High-Quality Segmentation

Deep Learning Based Detection and Correction of Cardiac MR Motion Artefacts During Reconstruction for High-Quality Segmentation

专知会员服务

59+阅读 · 2019年10月17日

【加州大学伯克利分校博士论文】通过自我监督预测学习泛化

【加州大学伯克利分校博士论文】通过自我监督预测学习泛化

专知会员服务

65+阅读 · 2019年10月9日

热门VIP内容

开通专知VIP会员享更多权益服务

【AAAI2026】FinRpt：面向证券研究报告生成的数据集、评测体系与基于大语言模型的多智能体框架

美陆军加速采购百万架无人机与激光武器以应对无人机威胁

【CMU博士论文】利用人工智能实现自动化发现

智能的基础：从人类认知视角综述数学文字题研究

相关资讯

灾难性遗忘问题新视角：迁移-干扰平衡

灾难性遗忘问题新视角：迁移-干扰平衡

CreateAMind

17+阅读 · 2019年7月6日

Hierarchically Structured Meta-learning

Hierarchically Structured Meta-learning

CreateAMind

27+阅读 · 2019年5月22日

Transferring Knowledge across Learning Processes

Transferring Knowledge across Learning Processes

CreateAMind

29+阅读 · 2019年5月18日

无监督元学习表示学习

无监督元学习表示学习

CreateAMind

27+阅读 · 2019年1月4日

Unsupervised Learning via Meta-Learning

Unsupervised Learning via Meta-Learning

CreateAMind

43+阅读 · 2019年1月3日

【泡泡一分钟】基于视频修复的时空转换网络

【泡泡一分钟】基于视频修复的时空转换网络

泡泡机器人SLAM

5+阅读 · 2018年12月30日

Capsule Networks解析

Capsule Networks解析

机器学习研究会

11+阅读 · 2017年11月12日

MoCoGAN 分解运动和内容的视频生成

MoCoGAN 分解运动和内容的视频生成

CreateAMind

18+阅读 · 2017年10月21日

【ICCV 2017论文集】计算机视觉顶级会议ICCV2017 Open Access Repository

【ICCV 2017论文集】计算机视觉顶级会议ICCV2017 Open Access Repository

专知

6+阅读 · 2017年10月14日

Auto-Encoding GAN

Auto-Encoding GAN

CreateAMind

7+阅读 · 2017年8月4日

相关论文

Gated Channel Transformation for Visual Recognition

Arxiv

4+阅读 · 2020年3月27日

Visual Grounding in Video for Unsupervised Word Translation

Visual Grounding in Video for Unsupervised Word Translation

Arxiv

7+阅读 · 2020年3月11日

Generating Question Relevant Captions to Aid Visual Question Answering

Generating Question Relevant Captions to Aid Visual Question Answering

Arxiv

5+阅读 · 2019年9月9日

An End-to-End Baseline for Video Captioning

Arxiv

6+阅读 · 2019年4月4日

3D Hand Shape and Pose Estimation from a Single RGB Image

3D Hand Shape and Pose Estimation from a Single RGB Image

Arxiv

17+阅读 · 2019年3月3日

Improving Neural Question Generation using Answer Separation

Improving Neural Question Generation using Answer Separation

Arxiv

3+阅读 · 2018年9月7日

Learning to Generate and Reconstruct 3D Meshes with only 2D Supervision

Learning to Generate and Reconstruct 3D Meshes with only 2D Supervision

Arxiv

5+阅读 · 2018年7月24日

Reconstruction Network for Video Captioning

Arxiv

5+阅读 · 2018年3月30日

A Unified Method for First and Third Person Action Recognition

Arxiv

3+阅读 · 2017年12月30日

Integrating both Visual and Audio Cues for Enhanced Video Caption

Arxiv

4+阅读 · 2017年12月9日

微信扫码咨询专知VIP会员