UnsafeChain：通过困难案例增强推理模型的安全性 (UnsafeChain: Enhancing Reasoning Model Safety via Hard Cases)

As large reasoning models (LRMs) grow more capable, chain-of-thought (CoT) reasoning introduces new safety challenges. Existing SFT-based safety alignment studies dominantly focused on filtering prompts with safe, high-quality responses, while overlooking hard prompts that always elicit harmful outputs. To fill this gap, we introduce UnsafeChain, a safety alignment dataset constructed from hard prompts with diverse sources, where unsafe completions are identified and explicitly corrected into safe responses. By exposing models to unsafe behaviors and guiding their correction, UnsafeChain enhances safety while preserving general reasoning ability. We fine-tune three LRMs on UnsafeChain and compare them against recent SafeChain and STAR-1 across six out-of-distribution and five in-distribution benchmarks. UnsafeChain consistently outperforms prior datasets, with even a 1K subset matching or surpassing baseline performance, demonstrating the effectiveness and generalizability of correction-based supervision. We release our dataset and code at https://github.com/mbzuai-nlp/UnsafeChain

翻译：随着大型推理模型（LRMs）能力不断增强，思维链（CoT）推理引入了新的安全挑战。现有的基于监督微调（SFT）的安全对齐研究主要集中于筛选具有安全、高质量响应的提示，而忽略了那些总是引发有害输出的困难提示。为填补这一空白，我们提出了UnsafeChain，这是一个从多种来源的困难提示构建的安全对齐数据集，其中不安全补全被识别并显式修正为安全响应。通过让模型接触不安全行为并引导其修正，UnsafeChain在保持通用推理能力的同时增强了安全性。我们在UnsafeChain上微调了三个LRMs，并在六个分布外和五个分布内基准测试中与最近的SafeChain和STAR-1进行了比较。UnsafeChain始终优于先前数据集，即使仅使用1K子集也能匹配或超越基线性能，证明了基于修正监督的有效性和泛化能力。我们在https://github.com/mbzuai-nlp/UnsafeChain发布了数据集和代码。

相关内容

数据集

关注 88

数据集，又称为资料集、数据集合或资料集合，是一种由数据所组成的集合。
Data set（或dataset）是一个数据的集合，通常以表格形式出现。每一列代表一个特定变量。每一行都对应于某一成员的数据集的问题。它列出的价值观为每一个变量，如身高和体重的一个物体或价值的随机数。每个数值被称为数据资料。对应于行数，该数据集的数据可能包括一个或多个成员。

FlowQA: Grasping Flow in History for Conversational Machine Comprehension

专知会员服务

34+阅读 · 2019年10月18日

Auto-Sizing the Transformer Network: Improving Speed, Efficiency, and Performance for Low-Resource Machine Translation

专知会员服务

50+阅读 · 2019年10月17日

Connections between Support Vector Machines, Wasserstein distance and gradient-penalty GANs

专知会员服务

36+阅读 · 2019年10月17日

Stabilizing Transformers for Reinforcement Learning

专知会员服务

60+阅读 · 2019年10月17日