为什么假标签的半监督学习算法有效? (Why the pseudo label based semi-supervised learning algorithm is effective?)

Recently, pseudo label based semi-supervised learning has achieved great success in many fields. The core idea of the pseudo label based semi-supervised learning algorithm is to use the model trained on the labeled data to generate pseudo labels on the unlabeled data, and then train a model to fit the previously generated pseudo labels. We give a theory analysis for why pseudo label based semi-supervised learning is effective in this paper. We mainly compare the generalization error of the model trained under two settings: (1) There are N labeled data. (2) There are N unlabeled data and a suitable initial model. Our analysis shows that, firstly, when the amount of unlabeled data tends to infinity, the pseudo label based semi-supervised learning algorithm can obtain model which have the same generalization error upper bound as model obtained by normally training in the condition of the amount of labeled data tends to infinity. More importantly, we prove that when the amount of unlabeled data is large enough, the generalization error upper bound of the model obtained by pseudo label based semi-supervised learning algorithm can converge to the optimal upper bound with linear convergence rate. We also give the lower bound on sampling complexity to achieve linear convergence rate. Our analysis contributes to understanding the empirical successes of pseudo label-based semi-supervised learning.

翻译：最近,基于半监督的伪标签学习在许多领域取得了巨大成功。基于伪标签的半监督的半监督的学习算法的核心理念是使用在标签数据上经过培训的模型在未标签数据上生成伪标签,然后培训一个模型以适应先前生成的伪标签。我们从理论上分析了为什么基于伪标签的半监督的学习在本文件中有效。我们主要比较了在以下两个设置下培训的模型的一般错误:(1) 有标签数据。(2) 有未标签的数据和适当的初始模型。我们的分析表明,首先,在未标签数据的数量往往不完全的情况下,基于半监督的伪标签学习算法可以获得与通常在标签数据数量条件下培训获得的模型具有相同一般错误的模型。更重要的是,我们证明,在无标签数据数量足够大的情况下,通过半监督的伪标签学习算法获得的模型的通用错误上限可以达到最佳的顶级统一率。我们还证明,基于半监督的伪标签学习算算法的顶级的顶级统一率可以达到最佳的升级率。