Large language models (LLMs) now mediate many web-based mental-health, crisis, and other emotionally sensitive services, yet their psychosocial safety in these settings remains poorly understood and weakly evaluated. We present DialogGuard, a multi-agent framework for assessing psychosocial risks in LLM-generated responses along five high-severity dimensions: privacy violations, discriminatory behaviour, mental manipulation, psychological harm, and insulting behaviour. DialogGuard can be applied to diverse generative models through four LLM-as-a-judge pipelines, including single-agent scoring, dual-agent correction, multi-agent debate, and stochastic majority voting, grounded in a shared three-level rubric usable by both human annotators and LLM judges. Using PKU-SafeRLHF with human safety annotations, we show that multi-agent mechanisms detect psychosocial risks more accurately than non-LLM baselines and single-agent judging; dual-agent correction and majority voting provide the best trade-off between accuracy, alignment with human ratings, and robustness, while debate attains higher recall but over-flags borderline cases. We release Dialog-Guard as open-source software with a web interface that provides per-dimension risk scores and explainable natural-language rationales. A formative study with 12 practitioners illustrates how it supports prompt design, auditing, and supervision of web-facing applications for vulnerable users.
翻译:大语言模型(LLMs)当前已广泛应用于网络心理健康服务、危机干预及其他情感敏感场景,然而其在这些情境下的心理社会安全性仍缺乏深入理解与有效评估。本文提出DialogGuard,一种多智能体框架,用于从五个高风险维度评估LLM生成响应的心理社会风险:隐私侵犯、歧视性行为、心理操纵、心理伤害及侮辱性行为。该框架通过四种基于LLM作为评判者的流程(包括单智能体评分、双智能体修正、多智能体辩论及随机多数投票)适用于多种生成模型,并基于一套人类标注者与LLM评判者通用的三级评估标准。利用带有人工安全标注的PKU-SafeRLHF数据集,我们证明多智能体机制在检测心理社会风险方面比非LLM基线及单智能体评判更准确;双智能体修正与多数投票在准确性、与人工评分的一致性及鲁棒性之间取得了最佳平衡,而辩论机制虽获得更高召回率,但易对边界案例过度标记。我们将DialogGuard作为开源软件发布,提供具备网络界面的系统,可输出各维度风险分数及可解释的自然语言依据。一项针对12名从业者的形成性研究表明,该框架能有效支持面向脆弱用户的网络应用提示词设计、审计与监督工作。