Large-scale audio language models (ALMs), such as Qwen2-Audio, are capable of comprehending diverse audio signal, performing audio analysis and generating textual responses. However, in speech emotion recognition (SER), ALMs often suffer from hallucinations, resulting in misclassifications or irrelevant outputs. To address these challenges, we propose C$^2$SER, a novel ALM designed to enhance the stability and accuracy of SER through Contextual perception and Chain of Thought (CoT). C$^2$SER integrates the Whisper encoder for semantic perception and Emotion2Vec-S for acoustic perception, where Emotion2Vec-S extends Emotion2Vec with semi-supervised learning to enhance emotional discrimination. Additionally, C$^2$SER employs a CoT approach, processing SER in a step-by-step manner while leveraging speech content and speaking styles to improve recognition. To further enhance stability, C$^2$SER introduces self-distillation from explicit CoT to implicit CoT, mitigating error accumulation and boosting recognition accuracy. Extensive experiments show that C$^2$SER outperforms existing popular ALMs, such as Qwen2-Audio and SECap, delivering more stable and precise emotion recognition. We release the training code, checkpoints, and test sets to facilitate further research.
翻译:大规模音频语言模型(如Qwen2-Audio)能够理解多样化的音频信号,执行音频分析并生成文本响应。然而,在语音情感识别任务中,音频语言模型常出现幻觉现象,导致误分类或无关输出。为应对这些挑战,本文提出C$^2$SER模型,这是一种通过上下文感知与思维链机制增强语音情感识别稳定性与准确性的新型音频语言模型。C$^2$SER集成Whisper编码器实现语义感知,并采用Emotion2Vec-S进行声学感知——该模型通过半监督学习扩展Emotion2Vec以提升情感判别能力。此外,C$^2$SER运用思维链方法,以逐步推理方式处理语音情感识别任务,同时利用语音内容与说话风格特征提升识别性能。为增强稳定性,模型引入从显式思维链到隐式思维链的自蒸馏机制,有效抑制误差累积并提高识别准确率。大量实验表明,C$^2$SER在性能上超越现有主流音频语言模型(如Qwen2-Audio与SECap),提供更稳定精确的情感识别结果。我们已公开训练代码、模型检查点与测试集以促进后续研究。