Modern data often arises with multiple modalities. For example, covariates and a network are observed on the same subjects, and both contain useful information. Effectively integrating these modalities is important and challenging, especially when the response is unavailable. We study the fundamental covariate selection problem for high-dimensional data by leveraging network information. We propose the Network-Guided Covariate Selection (NGCS) algorithm. NGCS exploits the spectral structure of the network to construct a network-guided screening statistic, and employs data-driven Higher Criticism Thresholding for covariate recovery. We establish consistency guarantees for NGCS under general networks. In particular, under two commonly used network models, we relate the projected signal strength to the individual signal strength, and demonstrate that NGCS is optimal for covariate selection. It could achieve the same rate as supervised learning. We further consider a two-study setting for downstream applications, where the network is observed only in Study 1. For clustering and regression, we propose NG-clu and NG-reg algorithms. NG-clu accurately clusters all subjects, while NG-reg improves prediction by using the post-selection covariate matrix. Experiments on synthetic and real datasets demonstrate the robustness and superior performance of our algorithms across various network models, noise distributions, and signal strengths.
翻译:现代数据常呈现多模态特征。例如,同一组受试者同时观测到协变量与网络数据,二者均蕴含有效信息。有效整合这些模态至关重要且具有挑战性,尤其在响应变量缺失的情况下。本研究通过利用网络信息,探讨高维数据中的基础协变量选择问题。我们提出网络引导协变量选择(NGCS)算法。NGCS利用网络谱结构构建网络引导筛选统计量,并采用数据驱动的高阶批评阈值法实现协变量恢复。我们在通用网络模型下为NGCS建立了收敛性保证。特别地,针对两种常用网络模型,我们将投影信号强度与个体信号强度相关联,证明NGCS在协变量选择中具有最优性,其收敛速率可与监督学习方法持平。进一步地,我们考虑下游应用中的双研究场景,其中网络数据仅在研究1中可观测。针对聚类与回归任务,我们提出NG-clu与NG-reg算法。NG-clu能精确聚类所有受试者,而NG-reg通过使用后选择协变量矩阵提升预测性能。在合成与真实数据集上的实验表明,我们的算法在不同网络模型、噪声分布及信号强度下均具有鲁棒性与优越性能。