Large Reasoning Models (LRMs) improve answer quality through explicit chain-of-thought, yet this very capability introduces new safety risks: harmful content can be subtly injected, surface gradually, or be justified by misleading rationales within the reasoning trace. Existing safety evaluations, however, primarily focus on output-level judgments and rarely capture these dynamic risks along the reasoning process. In this paper, we present SafeRBench, the first benchmark that assesses LRM safety end-to-end -- from inputs and intermediate reasoning to final outputs. (1) Input Characterization: We pioneer the incorporation of risk categories and levels into input design, explicitly accounting for affected groups and severity, and thereby establish a balanced prompt suite reflecting diverse harm gradients. (2) Fine-Grained Output Analysis: We introduce a micro-thought chunking mechanism to segment long reasoning traces into semantically coherent units, enabling fine-grained evaluation across ten safety dimensions. (3) Human Safety Alignment: We validate LLM-based evaluations against human annotations specifically designed to capture safety judgments. Evaluations on 19 LRMs demonstrate that SafeRBench enables detailed, multidimensional safety assessment, offering insights into risks and protective mechanisms from multiple perspectives.
翻译:大型推理模型通过显式的思维链提升了答案质量,但这一能力也引入了新的安全风险:有害内容可能被微妙地注入、逐渐浮现,或在推理轨迹中被误导性逻辑所合理化。然而,现有的安全性评估主要关注输出层面的判断,很少能捕捉推理过程中的这些动态风险。本文提出SafeRBench,这是首个端到端评估LRM安全性的基准——涵盖输入、中间推理到最终输出。(1)输入特征化:我们率先将风险类别和等级纳入输入设计,明确考虑受影响群体和严重程度,从而建立反映不同危害梯度的平衡提示集。(2)细粒度输出分析:我们引入微思维分块机制,将长推理轨迹分割成语义连贯的单元,实现跨十个安全维度的细粒度评估。(3)人类安全对齐:我们基于专门设计用于捕捉安全性判断的人工标注,验证基于LLM的评估方法。对19个LRM的评估表明,SafeRBench能够实现详细、多维度的安全性评估,从多个视角提供关于风险和保护机制的深入见解。