你有多少个标签?更仔细地看看金标准标签。 (How many labelers do you have? A closer look at gold-standard labels)

The construction of most supervised learning datasets revolves around collecting multiple labels for each instance, then aggregating the labels to form a type of ``gold-standard.''. We question the wisdom of this pipeline by developing a (stylized) theoretical model of this process and analyzing its statistical consequences, showing how access to non-aggregated label information can make training well-calibrated models easier or -- in some cases -- even feasible, whereas it is impossible with only gold-standard labels. The entire story, however, is subtle, and the contrasts between aggregated and fuller label information depend on the particulars of the problem, where estimators that use aggregated information exhibit robust but slower rates of convergence, while estimators that can effectively leverage all labels converge more quickly if they have fidelity to (or can learn) the true labeling process. The theory we develop in the stylized model makes several predictions for real-world datasets, including when non-aggregate labels should improve learning performance, which we test to corroborate the validity of our predictions.

翻译：最受监督的学习数据集的构建围绕每个实例收集多个标签,然后将标签合并成一种“黄金标准 ” 。我们质疑这一管道的智慧,为此开发了一个(标准化的)这一过程理论模型,分析其统计后果,表明如何获得非分类标签信息,使培训经过良好校准的模型更加容易,或在某些情况下甚至可行,而仅有黄金标准标签是不可能做到的。然而,整个故事是微妙的,综合标签信息与更全面标签信息之间的对比取决于问题的具体情况,即使用汇总信息的估算者表现出强健但速度较慢的趋同率,而如果他们忠于(或能够学习)真实标签进程,则能够有效利用所有标签的估算者会更快地聚集。我们在标准化模型中开发的理论对真实世界数据集作了数项预测,包括当非分类应该改进学习性能时,我们测试这些理论以证实我们的预测的有效性。