MULTI-Bench：用于评估口语对话模型情商能力的多轮交互基准 (MULTI-Bench: A Multi-Turn Interactive Benchmark for Assessing Emotional Intelligence ability of Spoken Dialogue Models)

Spoken Dialogue Models (SDMs) have advanced rapidly, yet their ability to sustain genuinely interactive multi-turn conversations remains underexplored, as most benchmarks focus on single-turn exchanges. We introduce Multi-Bench, the first benchmark explicitly designed to evaluate SDMs in multi-turn interactive dialogue with an emphasis on emotional intelligence. Multi-Bench employs a hierarchical structure with a basic track for emotion understanding and reasoning and an advanced track for emotion support and application. It comprises five carefully designed tasks and about 3.2K samples, ranging from emotion recognition to complex reasoning and interactive dialogue, supported by a reproducible evaluation framework. We evaluate six representative SDMs on eight subsets of Multi-Bench. Results show that while current SDMs achieve good performance on basic understanding tasks, they still have room for improvement in advanced multi-turn interactive dialogue and reasoning-related tasks, particularly in emotion awareness and application.

翻译：口语对话模型（SDMs）发展迅速，但其维持真正交互式多轮对话的能力仍未得到充分探索，因为现有基准大多关注单轮交互。我们提出了Multi-Bench，这是首个专门为评估SDMs在多轮交互对话中的表现而设计的基准，重点关注情商能力。Multi-Bench采用分层结构，包含基础赛道（用于情感理解与推理）和高级赛道（用于情感支持与应用）。该基准涵盖五个精心设计的任务和约3.2K个样本，范围从情感识别到复杂推理和交互式对话，并辅以可复现的评估框架。我们在Multi-Bench的八个子集上评估了六个代表性SDMs。结果表明，当前SDMs在基础理解任务上表现良好，但在高级多轮交互对话和推理相关任务（特别是情感感知与应用方面）仍有改进空间。