We study how training data contributes to the emergence of toxic behaviors in large language models. Most prior work on reducing model toxicity adopts reactive approaches, such as fine-tuning pre-trained (and potentially toxic) models to align them with human values. In contrast, we propose a proactive approach, IF-GUIDE, that leverages influence functions to identify and suppress harmful tokens in the training data. To this end, we first show that standard influence functions are ineffective at discovering harmful training records. We then present a novel adaptation that measures token-level attributions from training data to model toxicity, along with techniques for selecting toxic training documents and a learning objective that can be integrated into both pre-training and fine-tuning. Moreover, IF-GUIDE does not rely on human-preference data, which is typically required by existing alignment methods. In our evaluation, we demonstrate that IF-GUIDE substantially reduces both explicit and implicit toxicity-by up to 10$\times$ compared to uncensored models, and up to 3$\times$ compared to baseline alignment methods such as DPO and RAD-across both pre-training and fine-tuning scenarios. IF-GUIDE is computationally efficient: a billion-parameter model is not necessary for computing influence scores; a million-parameter model-with 7.5$\times$ fewer parameters-can effectively serve as a proxy for identifying harmful data. Our code is publicly available at: https://github.com/ztcoalson/IF-Guide
翻译:本研究探讨训练数据如何导致大型语言模型产生有害行为。现有降低模型毒性的工作多采用反应式方法,例如对预训练(可能已含毒性)模型进行微调以使其符合人类价值观。与之相反,我们提出一种主动式方法IF-GUIDE,利用影响函数识别并抑制训练数据中的有害标记。为此,我们首先证明标准影响函数在发现有害训练记录方面效果有限,进而提出一种新颖的改进方法,用于量化训练数据中标记级别特征对模型毒性的贡献度,同时配套开发了有害训练文档筛选技术及可融入预训练与微调阶段的学习目标。此外,IF-GUIDE无需依赖现有人工偏好数据(现有对齐方法通常需要此类数据)。评估结果显示:在预训练和微调场景中,IF-GUIDE能显著降低显性与隐性毒性——相较于未审查模型毒性降低达10倍,相较于DPO、RAD等基线对齐方法毒性降低达3倍。IF-GUIDE具备计算高效性:计算影响分数无需十亿参数规模的模型;仅需百万参数模型(参数量减少7.5倍)即可有效作为识别有害数据的代理模型。代码已开源:https://github.com/ztcoalson/IF-Guide