Target speaker extraction focuses on isolating a specific speaker's voice from an audio mixture containing multiple speakers. To provide information about the target speaker's identity, prior works have utilized clean audio samples as conditioning inputs. However, such clean audio examples are not always readily available. For instance, obtaining a clean recording of a stranger's voice at a cocktail party without leaving the noisy environment is generally infeasible. Limited prior research has explored extracting the target speaker's characteristics from noisy enrollments, which may contain overlapping speech from interfering speakers. In this work, we explore a novel enrollment strategy that encodes target speaker information from the noisy enrollment by comparing segments where the target speaker is talking (Positive Enrollments) with segments where the target speaker is silent (Negative Enrollments). Experiments show the effectiveness of our model architecture, which achieves over 2.1 dB higher SI-SNRi compared to prior works in extracting the monaural speech from the mixture of two speakers. Additionally, the proposed two-stage training strategy accelerates convergence, reducing the number of optimization steps required to reach 3 dB SNR by 60%. Overall, our method achieves state-of-the-art performance in the monaural target speaker extraction conditioned on noisy enrollments. Our implementation is available at https://github.com/xu-shitong/TSE-through-Positive-Negative-Enroll .
翻译:目标说话人提取旨在从包含多名说话人的音频混合信号中分离出特定说话人的语音。为提供目标说话人身份信息,先前研究多采用纯净音频样本作为条件输入。然而,此类纯净音频样本并非总能轻易获取。例如,在鸡尾酒会等嘈杂环境中,不离开现场而获取陌生人语音的纯净录音通常难以实现。现有研究鲜少探索从带噪注册音频中提取目标说话人特征,此类音频可能包含干扰说话人的重叠语音。本研究提出一种新颖的注册策略:通过对比目标说话人发声片段(正向注册)与目标说话人静默片段(负向注册),从带噪注册音频中编码目标说话人信息。实验证明,我们的模型架构在从双人语音混合中提取单声道语音时,相比先前方法实现了超过2.1 dB的SI-SNRi提升。此外,所提出的两阶段训练策略加速了收敛过程,将信噪比达到3 dB所需的优化步骤减少了60%。总体而言,本方法在基于带噪注册的单声道目标说话人提取任务中取得了最先进的性能。代码实现已发布于 https://github.com/xu-shitong/TSE-through-Positive-Negative-Enroll。