Everyday speech conveys far more than words, it reflects who we are, how we feel, and the circumstances surrounding our interactions. Yet, most existing speech datasets are acted, limited in scale, and fail to capture the expressive richness of real-life communication. With the rise of large neural networks, several large-scale speech corpora have emerged and been widely adopted across various speech processing tasks. However, the field of voice conversion (VC) still lacks large-scale, expressive, and real-life speech resources suitable for modeling natural prosody and emotion. To fill this gap, we release NaturalVoices (NV), the first large-scale spontaneous podcast dataset specifically designed for emotion-aware voice conversion. It comprises 5,049 hours of spontaneous podcast recordings with automatic annotations for emotion (categorical and attribute-based), speech quality, transcripts, speaker identity, and sound events. The dataset captures expressive emotional variation across thousands of speakers, diverse topics, and natural speaking styles. We also provide an open-source pipeline with modular annotation tools and flexible filtering, enabling researchers to construct customized subsets for a wide range of VC tasks. Experiments demonstrate that NaturalVoices supports the development of robust and generalizable VC models capable of producing natural, expressive speech, while revealing limitations of current architectures when applied to large-scale spontaneous data. These results suggest that NaturalVoices is both a valuable resource and a challenging benchmark for advancing the field of voice conversion. Dataset is available at: https://huggingface.co/JHU-SmileLab
翻译:日常言语所传达的远不止词汇本身,它反映了我们的身份、情感状态以及交流时的具体情境。然而,现有的大多数语音数据集多为表演性质,规模有限,且未能捕捉到真实交流中丰富的表达性。随着大型神经网络的兴起,多个大规模语音语料库相继出现,并被广泛应用于各类语音处理任务。然而,语音转换领域仍缺乏适用于自然韵律与情感建模的大规模、表达性强且源自真实生活的语音资源。为填补这一空白,我们发布了NaturalVoices(NV),这是首个专为情感感知语音转换设计的大规模自发式播客数据集。该数据集包含5,049小时的自发播客录音,并自动标注了情感(分类与基于属性的)、语音质量、文本转录、说话人身份及声音事件。数据集捕捉了数千名说话人在不同话题和自然说话风格下的丰富情感变化。我们还提供了一套开源流程,包含模块化标注工具和灵活筛选机制,使研究人员能够为广泛的语音转换任务构建定制化数据子集。实验表明,NaturalVoices支持开发鲁棒且泛化能力强的语音转换模型,能够生成自然且富有表现力的语音,同时也揭示了当前架构在处理大规模自发数据时的局限性。这些结果表明,NaturalVoices既是推动语音转换领域发展的宝贵资源,也是一个具有挑战性的基准。数据集可通过以下链接获取:https://huggingface.co/JHU-SmileLab