Human voice encodes both identity and paralinguistic cues, yet encoders in large audio-language models (LALMs) rarely balance both aspects. In this work, we present a study toward building a general-purpose voice encoder that captures nuanced voice cues. Through a comprehensive evaluation, we find that multi-task training yields the most balanced representations, whereas contrastive language-audio pretraining (CLAP) primarily improves retrieval without enhancing paralinguistic understanding. Our final encoder, Auden-Voice, also demonstrates strong performance when integrated with LLMs. The code and training recipes will be released with the audio understanding toolkit Auden.
翻译:人类语音同时编码了身份信息与副语言线索,然而大型音频-语言模型(LALMs)中的编码器很少能平衡这两个方面。本工作旨在构建一种能够捕捉细微语音线索的通用语音编码器。通过全面评估,我们发现多任务训练能够产生最均衡的表征,而对比性语言-音频预训练(CLAP)主要提升检索性能,并未增强副语言理解能力。我们最终的编码器Auden-Voice在集成至大语言模型(LLMs)时亦展现出强劲性能。代码与训练方案将随音频理解工具包Auden一同发布。