Future superhuman models will surpass the ability of humans and humans will only be able to \textit{weakly} supervise superhuman models. To alleviate the issue of lacking high-quality data for model alignment, some works on weak-to-strong generalization (W2SG) finetune a strong pretrained model with a weak supervisor so that it can generalize beyond weak supervision. However, the invariable use of weak supervision in existing methods exposes issues in robustness, with a proportion of weak labels proving harmful to models. In this paper, we propose a selective W2SG framework to avoid using weak supervision when unnecessary. We train a binary classifier P(IK) to identify questions that a strong model can answer and use its self-generated labels for alignment. We further refine weak labels with a graph smoothing method. Extensive experiments on three benchmarks show that our method consistently outperforms competitive baselines. Further analyses show that P(IK) can generalize across tasks and difficulties, which indicates selective W2SG can help superalignment.
翻译:未来的超人类模型将超越人类能力,人类只能对超人类模型进行弱监督。为缓解模型对齐中高质量数据不足的问题,弱到强泛化(W2SG)相关研究通过使用弱监督器对强预训练模型进行微调,使其能够超越弱监督实现泛化。然而,现有方法中不加区分地使用弱监督暴露了鲁棒性问题,部分弱标签被证明对模型有害。本文提出一种选择性W2SG框架,以避免在非必要情况下使用弱监督。我们训练一个二元分类器P(IK)来识别强模型能够回答的问题,并利用其自生成标签进行对齐。此外,我们通过图平滑方法对弱标签进行优化。在三个基准测试上的大量实验表明,我们的方法始终优于竞争基线。进一步分析显示,P(IK)能够跨任务和难度泛化,这表明选择性W2SG有助于实现超对齐。