Recent developments in speech emotion recognition (SER) often leverage deep neural networks (DNNs). Comparing and benchmarking different DNN models can often be tedious due to the use of different datasets and evaluation protocols. To facilitate the process, here, we present the Speech Emotion Recognition Adaptation Benchmark (SERAB), a framework for evaluating the performance and generalization capacity of different approaches for utterance-level SER. The benchmark is composed of nine datasets for SER in six languages. Since the datasets have different sizes and numbers of emotional classes, the proposed setup is particularly suitable for estimating the generalization capacity of pre-trained DNN-based feature extractors. We used the proposed framework to evaluate a selection of standard hand-crafted feature sets and state-of-the-art DNN representations. The results highlight that using only a subset of the data included in SERAB can result in biased evaluation, while compliance with the proposed protocol can circumvent this issue.
翻译:语音情绪识别(SER)的近期发展往往会影响深层神经网络(DNNs) 。由于使用不同的数据集和评价程序,比较和衡量不同的DNN模式往往会很乏味。为了便利这一过程,我们在此介绍情感识别调适基准(SERAB),这是评价不同表达SER级别方法的性能和一般化能力的框架。基准由9个SER六种语言的数据集组成。由于数据集的大小和情感等级不同,拟议的设置特别适合估计预先训练的DNNP地物提取器的一般化能力。我们利用拟议的框架来评价标准手工制作的地物集的选择和DNNP的最新表现。结果突出表明,只有SERAB中包含的一组数据才能导致偏颇的评价,而遵守拟议的议定书则可以绕过这一问题。