In multi-source learning with discrete labels, distributional heterogeneity across domains poses a central challenge to developing predictive models that transfer reliably to unseen domains. We study multi-source unsupervised domain adaptation, where labeled data are available from multiple source domains and only unlabeled data are observed from the target domain. To address potential distribution shifts, we propose a novel Conditional Group Distributionally Robust Optimization (CG-DRO) framework that learns a classifier by minimizing the worst-case cross-entropy loss over the convex combinations of the conditional outcome distributions from sources domains. We develop an efficient Mirror Prox algorithm for solving the minimax problem and employ a double machine learning procedure to estimate the risk function, ensuring that errors in nuisance estimation contribute only at higher-order rates. We establish fast statistical convergence rates for the empirical CG-DRO estimator by constructing two surrogate minimax optimization problems that serve as theoretical bridges. A distinguishing challenge for CG-DRO is the emergence of nonstandard asymptotics: the empirical CG-DRO estimator may fail to converge to a standard limiting distribution due to boundary effects and system instability. To address this, we introduce a perturbation-based inference procedure that enables uniformly valid inference, including confidence interval construction and hypothesis testing.
翻译:在离散标签的多源学习中,跨域分布异质性构成了开发能够可靠迁移到未见域预测模型的核心挑战。本研究聚焦于多源无监督域自适应问题,其中多个源域提供标注数据,而目标域仅观测到未标注数据。为应对潜在的分布偏移,我们提出了一种新颖的条件组分布鲁棒优化框架,该框架通过最小化源域条件结果分布的凸组合上的最坏情况交叉熵损失来学习分类器。我们开发了高效的镜像近端算法求解该极小极大问题,并采用双重机器学习程序估计风险函数,确保干扰参数估计误差仅以高阶速率影响结果。通过构建两个作为理论桥梁的代理极小极大优化问题,我们建立了经验CG-DRO估计量的快速统计收敛速率。CG-DRO面临的一个显著挑战是非标准渐近行为的出现:由于边界效应和系统不稳定性,经验CG-DRO估计量可能无法收敛到标准极限分布。为此,我们提出了一种基于扰动的推断方法,能够实现包括置信区间构建和假设检验在内的均匀有效推断。