Demand for mental health support through AI chatbots is surging, though current systems present several limitations, like sycophancy or overvalidation, and reinforcement of maladaptive beliefs. A core obstacle to the creation of better systems is the scarcity of benchmarks that capture the complexity of real therapeutic interactions. Most existing benchmarks either only test clinical knowledge through multiple-choice questions or assess single responses in isolation. To bridge this gap, we present MindEval, a framework designed in collaboration with Ph.D-level Licensed Clinical Psychologists for automatically evaluating language models in realistic, multi-turn mental health therapy conversations. Through patient simulation and automatic evaluation with LLMs, our framework balances resistance to gaming with reproducibility via its fully automated, model-agnostic design. We begin by quantitatively validating the realism of our simulated patients against human-generated text and by demonstrating strong correlations between automatic and human expert judgments. Then, we evaluate 12 state-of-the-art LLMs and show that all models struggle, scoring below 4 out of 6, on average, with particular weaknesses in problematic AI-specific patterns of communication. Notably, reasoning capabilities and model scale do not guarantee better performance, and systems deteriorate with longer interactions or when supporting patients with severe symptoms. We release all code, prompts, and human evaluation data.
翻译:通过AI聊天机器人提供心理健康支持的需求正在激增,然而现有系统存在诸多局限,如迎合性、过度肯定或强化适应不良的信念。构建更优系统的一个核心障碍是缺乏能够捕捉真实治疗互动复杂性的基准测试。现有基准大多仅通过选择题测试临床知识,或孤立地评估单轮回复。为填补这一空白,我们提出了MindEval——一个与拥有博士学位的持证临床心理学家合作设计的框架,用于在真实的多轮心理健康治疗对话中自动评估语言模型。通过患者模拟和基于大语言模型的自动评估,我们的框架凭借其全自动、模型无关的设计,在抗操纵性与可复现性之间取得了平衡。我们首先通过量化验证模拟患者文本相对于人类生成文本的真实性,并证明自动评估与人类专家判断之间存在强相关性。随后,我们对12个前沿大语言模型进行了评估,结果表明所有模型均表现不佳,平均得分低于6分制中的4分,尤其在AI特有的问题性沟通模式上存在明显缺陷。值得注意的是,推理能力和模型规模并不能保证更好的性能,且系统在较长交互过程中或支持症状严重的患者时表现会恶化。我们已公开所有代码、提示词和人工评估数据。