The effective analysis of high-dimensional Electronic Health Record (EHR) data, with substantial potential for healthcare research, presents notable methodological challenges. Employing predictive modeling guided by a knowledge graph (KG), which enables efficient feature selection, can enhance both statistical efficiency and interpretability. While various methods have emerged for constructing KGs, existing techniques often lack statistical certainty concerning the presence of links between entities, especially in scenarios where the utilization of patient-level EHR data is limited due to privacy concerns. In this paper, we propose the first inferential framework for deriving a sparse KG with statistical guarantee based on the dynamic log-linear topic model proposed by \cite{arora2016latent}. Within this model, the KG embeddings are estimated by performing singular value decomposition on the empirical pointwise mutual information matrix, offering a scalable solution. We then establish entrywise asymptotic normality for the KG low-rank estimator, enabling the recovery of sparse graph edges with controlled type I error. Our work uniquely addresses the under-explored domain of statistical inference about non-linear statistics under the low-rank temporal dependent models, a critical gap in existing research. We validate our approach through extensive simulation studies and then apply the method to real-world EHR data in constructing clinical KGs and generating clinical feature embeddings.
翻译:高维电子健康记录(EHR)数据的有效分析在医疗健康研究中具有巨大潜力,但也带来了显著的方法学挑战。采用知识图谱(KG)指导的预测建模,能够实现高效的特征选择,从而提升统计效率与可解释性。尽管已有多种构建知识图谱的方法出现,但现有技术往往缺乏关于实体间连接存在的统计确定性,尤其是在因隐私问题导致患者层面EHR数据使用受限的场景中。本文中,我们提出了首个基于统计保证推导稀疏知识图谱的推断框架,该框架建立在\\cite{arora2016latent}提出的动态对数线性主题模型之上。在此模型中,通过对经验点互信息矩阵执行奇异值分解来估计知识图谱嵌入,提供了一种可扩展的解决方案。我们随后建立了知识图谱低秩估计量的逐项渐近正态性,从而能够在控制第一类错误的前提下恢复稀疏图边。我们的工作独特地解决了低秩时间依赖模型下非线性统计量统计推断这一尚未充分探索的领域,填补了现有研究中的一个关键空白。我们通过广泛的模拟研究验证了我们的方法,并将该方法应用于真实世界的EHR数据,以构建临床知识图谱并生成临床特征嵌入。