Batch effects pose a significant challenge in the analysis of single-cell omics data, introducing technical artifacts that confound biological signals. While various computational methods have achieved empirical success in correcting these effects, they lack the formal theoretical guarantees required to assess their reliability and generalization. To bridge this gap, we introduce Mixture-Model-based Data Harmonization (MoDaH), a principled batch correction algorithm grounded in a rigorous statistical framework. Under a new Gaussian-mixture-model with explicit parametrization of batch effects, we establish the minimax optimal error rates for batch correction and prove that MoDaH achieves this rate by leveraging the recent theoretical advances in clustering data from anisotropic Gaussian mixtures. This constitutes, to the best of our knowledge, the first theoretical guarantee for batch correction. Extensive experiments on diverse single-cell RNA-seq and spatial proteomics datasets demonstrate that MoDaH not only attains theoretical optimality but also achieves empirical performance comparable to or even surpassing those of state-of-the-art heuristics (e.g., Harmony, Seurat-V5, and LIGER), effectively balancing the removal of technical noise with the conservation of biological signal.
翻译:批次效应在单细胞组学数据分析中构成重大挑战,其引入的技术伪影会混淆生物学信号。尽管多种计算方法在校正这些效应方面取得了经验性成功,但它们缺乏评估其可靠性与泛化性所需的形式化理论保证。为弥合这一差距,我们提出了基于混合模型的数据协调方法(MoDaH),这是一种基于严格统计框架的原则性批次校正算法。在一种显式参数化批次效应的高斯混合模型新框架下,我们建立了批次校正的极小极大最优误差率,并证明MoDaH通过利用各向异性高斯混合数据聚类的最新理论进展达到了该速率。据我们所知,这是首次为批次校正提供理论保证。在多种单细胞RNA-seq和空间蛋白质组学数据集上的大量实验表明,MoDaH不仅实现了理论最优性,其经验性能更可与甚至超越当前最先进的启发式方法(如Harmony、Seurat-V5和LIGER)相媲美,在有效去除技术噪声的同时保持了生物学信号的完整性。