Recognition of signers' emotions suffers from one theoretical challenge and one practical challenge, namely, the overlap between grammatical and affective facial expressions and the scarcity of data for model training. This paper addresses these two challenges in a cross-lingual setting using our eJSL dataset, a new benchmark dataset for emotion recognition in Japanese Sign Language signers, and BOBSL, a large British Sign Language dataset with subtitles. In eJSL, two signers expressed 78 distinct utterances with each of seven different emotional states, resulting in 1,092 video clips. We empirically demonstrate that 1) textual emotion recognition in spoken language mitigates data scarcity in sign language, 2) temporal segment selection has a significant impact, and 3) incorporating hand motion enhances emotion recognition in signers. Finally we establish a stronger baseline than spoken language LLMs.
翻译:手语者情感识别面临一个理论挑战和一个实践挑战,即语法性面部表情与情感性面部表情的重叠性,以及模型训练数据的稀缺性。本文通过跨语言设置,利用我们提出的eJSL数据集(一个用于日本手语者情感识别的新基准数据集)和BOBSL(一个带有字幕的大规模英国手语数据集),解决了这两个挑战。在eJSL数据集中,两名手语者以七种不同的情感状态分别表达了78条独立话语,共生成1,092个视频片段。我们通过实验证明:1)口语文本情感识别可缓解手语数据稀缺问题;2)时序片段选择对识别效果具有显著影响;3)结合手部运动特征能提升手语者情感识别性能。最终,我们建立了优于口语大语言模型的更强基线。