Our ability to interpret others' mental states through nonverbal cues (NVCs) is fundamental to our survival and social cohesion. While existing Theory of Mind (ToM) benchmarks have primarily focused on false-belief tasks and reasoning with asymmetric information, they overlook other mental states beyond belief and the rich tapestry of human nonverbal communication. We present Motion2Mind, a framework for evaluating the ToM capabilities of machines in interpreting NVCs. Leveraging an expert-curated body-language reference as a proxy knowledge base, we build Motion2Mind, a carefully curated video dataset with fine-grained nonverbal cue annotations paired with manually verified psychological interpretations. It encompasses 222 types of nonverbal cues and 397 mind states. Our evaluation reveals that current AI systems struggle significantly with NVC interpretation, exhibiting not only a substantial performance gap in Detection, as well as patterns of over-interpretation in Explanation compared to human annotators.
翻译:我们通过非言语线索(NVCs)解读他人心理状态的能力,对于生存和社会凝聚力至关重要。尽管现有的心智理论(ToM)基准主要侧重于错误信念任务和不对称信息推理,但它们忽视了信念之外的其他心理状态以及人类非言语交流的丰富图景。我们提出了Motion2Mind,一个用于评估机器在解读非言语线索方面心智理论能力的框架。利用专家整理的肢体语言参考作为代理知识库,我们构建了Motion2Mind——一个精心策划的视频数据集,包含细粒度的非言语线索标注和人工验证的心理解释。该数据集涵盖222种非言语线索类型和397种心理状态。我们的评估表明,当前人工智能系统在非言语线索解读方面存在显著困难,不仅在检测任务上表现出与人类标注者之间的巨大性能差距,在解释任务中也显示出过度解读的模式。