基于多站点数据的异质混合模型分布式推断 (Distributed inference for heterogeneous mixture models using multi-site data)

Mixture models postulate the overall population as a mixture of finite subpopulations with unobserved membership. Fitting mixture models usually requires large sample sizes and combining data from multiple sites can be beneficial. However, sharing individual participant data across sites is often less feasible due to various types of practical constraints, such as data privacy concerns. Moreover, substantial heterogeneity may exist across sites, and locally identified latent classes may not be comparable across sites. We propose a unified modeling framework where a common definition of the latent classes is shared across sites and heterogeneous mixing proportions of latent classes are allowed to account for between-site heterogeneity. To fit the heterogeneous mixture model on multi-site data, we propose a novel distributed Expectation-Maximization (EM) algorithm where at each iteration a density ratio tilted surrogate Q function is constructed to approximate the standard Q function of the EM algorithm as if the data from multiple sites could be pooled together. Theoretical analysis shows that our estimator achieves the same contraction property as the estimators derived from the EM algorithm based on the pooled data.

翻译：混合模型假设总体由有限个未观测成员身份的子总体混合而成。拟合混合模型通常需要大样本量，整合多站点数据可带来益处。然而，由于数据隐私顾虑等各类实际限制，跨站点共享个体参与者数据往往难以实现。此外，站点间可能存在显著异质性，且局部识别的潜在类别在不同站点间可能不可比。我们提出了一个统一的建模框架，其中潜在类别的共同定义在站点间共享，并允许潜在类别的异质混合比例以解释站点间的异质性。为在多站点数据上拟合异质混合模型，我们提出了一种新颖的分布式期望最大化（EM）算法，该算法在每次迭代中构建一个密度比倾斜的代理Q函数，以近似模拟多站点数据可合并时的标准EM算法Q函数。理论分析表明，我们的估计器实现了与基于合并数据的EM算法所推导估计器相同的收缩性质。