从医疗记录到诊断对话：一种基于临床背景的精神疾病共病研究方法与数据集 (From Medical Records to Diagnostic Dialogues: A Clinical-Grounded Approach and Dataset for Psychiatric Comorbidity)

Psychiatric comorbidity is clinically significant yet challenging due to the complexity of multiple co-occurring disorders. To address this, we develop a novel approach integrating synthetic patient electronic medical record (EMR) construction and multi-agent diagnostic dialogue generation. We create 502 synthetic EMRs for common comorbid conditions using a pipeline that ensures clinical relevance and diversity. Our multi-agent framework transfers the clinical interview protocol into a hierarchical state machine and context tree, supporting over 130 diagnostic states while maintaining clinical standards. Through this rigorous process, we construct PsyCoTalk, the first large-scale dialogue dataset supporting comorbidity, containing 3,000 multi-turn diagnostic dialogues validated by psychiatrists. This dataset enhances diagnostic accuracy and treatment planning, offering a valuable resource for psychiatric comorbidity research. Compared to real-world clinical transcripts, PsyCoTalk exhibits high structural and linguistic fidelity in terms of dialogue length, token distribution, and diagnostic reasoning strategies. Licensed psychiatrists confirm the realism and diagnostic validity of the dialogues. This dataset enables the development and evaluation of models capable of multi-disorder psychiatric screening in a single conversational pass.

翻译：精神疾病共病具有重要的临床意义，但由于多种疾病同时发生的复杂性，其诊断极具挑战性。为解决这一问题，我们开发了一种整合合成患者电子病历构建与多智能体诊断对话生成的新方法。我们通过一个确保临床相关性和多样性的流程，为常见共病状况创建了502份合成电子病历。我们的多智能体框架将临床访谈协议转化为分层状态机和上下文树，支持超过130种诊断状态，同时保持临床标准。通过这一严谨流程，我们构建了PsyCoTalk——首个支持共病研究的大规模对话数据集，包含3000个经精神科医生验证的多轮诊断对话。该数据集提升了诊断准确性和治疗规划能力，为精神疾病共病研究提供了宝贵资源。与真实临床记录相比，PsyCoTalk在对话长度、词汇分布和诊断推理策略方面展现出高度的结构和语言保真度。持证精神科医生确认了对话的真实性和诊断有效性。该数据集支持开发和评估能够在单次对话中实现多障碍精神疾病筛查的模型。

相关内容

数据集

关注 88

数据集，又称为资料集、数据集合或资料集合，是一种由数据所组成的集合。
Data set（或dataset）是一个数据的集合，通常以表格形式出现。每一列代表一个特定变量。每一行都对应于某一成员的数据集的问题。它列出的价值观为每一个变量，如身高和体重的一个物体或价值的随机数。每个数值被称为数据资料。对应于行数，该数据集的数据可能包括一个或多个成员。

FlowQA: Grasping Flow in History for Conversational Machine Comprehension

专知会员服务

34+阅读 · 2019年10月18日

Auto-Sizing the Transformer Network: Improving Speed, Efficiency, and Performance for Low-Resource Machine Translation

专知会员服务

50+阅读 · 2019年10月17日

Connections between Support Vector Machines, Wasserstein distance and gradient-penalty GANs

专知会员服务

36+阅读 · 2019年10月17日

Deep Learning Based Detection and Correction of Cardiac MR Motion Artefacts During Reconstruction for High-Quality Segmentation

专知会员服务

59+阅读 · 2019年10月17日