Demand for mental health support through AI chatbots is surging, though current systems present several limitations, like sycophancy or overvalidation, and reinforcement of maladaptive beliefs. A core obstacle to the creation of better systems is the scarcity of benchmarks that capture the complexity of real therapeutic interactions. Most existing benchmarks either only test clinical knowledge through multiple-choice questions or assess single responses in isolation. To bridge this gap, we present MindEval, a framework designed in collaboration with Ph.D-level Licensed Clinical Psychologists for automatically evaluating language models in realistic, multi-turn mental health therapy conversations. Through patient simulation and automatic evaluation with LLMs, our framework balances resistance to gaming with reproducibility via its fully automated, model-agnostic design. We begin by quantitatively validating the realism of our simulated patients against human-generated text and by demonstrating strong correlations between automatic and human expert judgments. Then, we evaluate 12 state-of-the-art LLMs and show that all models struggle, scoring below 4 out of 6, on average, with particular weaknesses in problematic AI-specific patterns of communication. Notably, reasoning capabilities and model scale do not guarantee better performance, and systems deteriorate with longer interactions or when supporting patients with severe symptoms. We release all code, prompts, and human evaluation data.


翻译:通过AI聊天机器人提供心理健康支持的需求正在激增,但现有系统存在诸多局限,如谄媚或过度认同、强化适应不良的信念等。构建更优系统的一个核心障碍是缺乏能够捕捉真实治疗互动复杂性的基准测试。现有基准大多仅通过选择题测试临床知识,或孤立评估单次回复。为填补这一空白,我们提出了MindEval框架,该框架与拥有博士学位的持证临床心理学家合作设计,用于在真实的多轮心理健康治疗对话中自动评估语言模型。通过患者模拟和基于大语言模型的自动评估,我们的框架凭借其全自动、模型无关的设计,在抗操纵性与可复现性之间取得了平衡。我们首先通过量化验证模拟患者文本相对于人类生成文本的真实性,并证明自动评估与人类专家判断之间存在强相关性。随后,我们评估了12个前沿大语言模型,结果显示所有模型均表现不佳,平均得分低于6分制中的4分,尤其在AI特有的问题性沟通模式上存在明显缺陷。值得注意的是,推理能力和模型规模并不能保证更好的性能,且系统在较长交互过程中或支持症状严重患者时表现会恶化。我们已公开所有代码、提示词和人工评估数据。

0
下载
关闭预览

相关内容

ACM/IEEE第23届模型驱动工程语言和系统国际会议,是模型驱动软件和系统工程的首要会议系列,由ACM-SIGSOFT和IEEE-TCSE支持组织。自1998年以来,模型涵盖了建模的各个方面,从语言和方法到工具和应用程序。模特的参加者来自不同的背景,包括研究人员、学者、工程师和工业专业人士。MODELS 2019是一个论坛,参与者可以围绕建模和模型驱动的软件和系统交流前沿研究成果和创新实践经验。今年的版本将为建模社区提供进一步推进建模基础的机会,并在网络物理系统、嵌入式系统、社会技术系统、云计算、大数据、机器学习、安全、开源等新兴领域提出建模的创新应用以及可持续性。 官网链接:http://www.modelsconference.org/
Top
微信扫码咨询专知VIP会员