它既听又看：通过将视觉理解融入音频语言模型实现多模态大语言模型用于抑郁症检测 (It Hears, It Sees too: Multi-Modal LLM for Depression Detection By Integrating Visual Understanding into Audio Language Models)

Depression is one of the most prevalent mental health disorders globally. In recent years, multi-modal data, such as speech, video, and transcripts, has been increasingly used to develop AI-assisted depression assessment systems. Large language models have further advanced this field due to their strong language understanding and generalization capabilities. However, conventional LLMs remain text-centric and cannot process the rich non-verbal cues found in audio and visual modalities, which are critical components in mental health evaluation. While multi-modal LLMs offer a promising direction, few are tailored for psychological applications. In this study, we propose a novel multi-modal LLM framework for depression detection. Our approach augments an audio language model with visual understanding and aligns audio-visual features at the timestamp level. This fine-grained alignment improves modeling of temporal dynamics across modalities while reducing the need for extensive training data and computational resources. Experiments on the DAIC-WoZ dataset demonstrate that our model outperforms both single-modality approaches and previous multi-modal methods. Moreover, the proposed framework can be extended to incorporate additional physiological signals, paving the way for broader clinical applications beyond mental health.

翻译：抑郁症是全球最普遍的心理健康障碍之一。近年来，语音、视频和文本等多模态数据被越来越多地用于开发人工智能辅助的抑郁评估系统。大语言模型凭借其强大的语言理解和泛化能力，进一步推动了该领域的发展。然而，传统的大语言模型仍以文本为中心，无法处理音频和视觉模态中丰富的非语言线索，而这些线索是心理健康评估的关键组成部分。尽管多模态大语言模型提供了一个有前景的方向，但鲜有专门为心理学应用设计的模型。在本研究中，我们提出了一种新颖的多模态大语言模型框架用于抑郁症检测。我们的方法通过视觉理解增强音频语言模型，并在时间戳级别对齐视听特征。这种细粒度的对齐改善了跨模态时间动态的建模，同时减少了对大量训练数据和计算资源的需求。在DAIC-WoZ数据集上的实验表明，我们的模型优于单模态方法和先前的多模态方法。此外，所提出的框架可以扩展以整合额外的生理信号，为超越心理健康领域的更广泛临床应用铺平道路。