无标签数据如何改进自我培训中的概括化? (How does unlabeled data improve generalization in self-training? A one-hidden-layer theoretical analysis)

Self-training, a semi-supervised learning algorithm, leverages a large amount of unlabeled data to improve learning when the labeled data are limited. Despite empirical successes, its theoretical characterization remains elusive. To the best of our knowledge, this work establishes the first theoretical analysis for the known iterative self-training paradigm and proves the benefits of unlabeled data in both training convergence and generalization ability. To make our theoretical analysis feasible, we focus on the case of one-hidden-layer neural networks. However, theoretical understanding of iterative self-training is non-trivial even for a shallow neural network. One of the key challenges is that existing neural network landscape analysis built upon supervised learning no longer holds in the (semi-supervised) self-training paradigm. We address this challenge and prove that iterative self-training converges linearly with both convergence rate and generalization accuracy improved in the order of $1/\sqrt{M}$, where $M$ is the number of unlabeled samples. Experiments from shallow neural networks to deep neural networks are also provided to justify the correctness of our established theoretical insights on self-training.

翻译：自我培训是一种半监督的学习算法,它利用大量未贴标签的数据来改进在标签数据有限的情况下的学习。尽管取得了经验性的成功,但其理论特征仍然难以找到。根据我们的最佳知识,这项工作为已知的迭代自我培训范式确立了第一个理论分析,并证明了未贴标签的数据在培训趋同和概括化能力方面的益处。为了使理论分析可行,我们把重点放在单层神经网络上。然而,对迭代自我培训的理论理解是非边际的,即使是浅层神经网络也是如此。其中一项关键挑战是,在受监督的学习基础上建立的现有神经网络景观分析不再存在于(半监督的)自我培训范式中。我们应对这一挑战,并证明迭代自我培训与1美元/毫微克/毫拉特的趋同率和一般化精度的提高呈线性一致,而美元是未贴标签的样本数量。从浅层神经网络到深层神经网络的实验也是用来证明我们在自我培训方面既定理论洞察力的正确性。