超越过度拒绝：面向LLM夸大性拒绝的场景化诊断与后处理缓解方法 (Beyond Over-Refusal: Scenario-Based Diagnostics and Post-Hoc Mitigation for Exaggerated Refusals in LLMs)

Large language models (LLMs) frequently produce false refusals, declining benign requests that contain terms resembling unsafe queries. We address this challenge by introducing two comprehensive benchmarks: the Exaggerated Safety Benchmark (XSB) for single-turn prompts, annotated with "Focus" keywords that identify refusal-inducing triggers, and the Multi-turn Scenario-based Exaggerated Safety Benchmark (MS-XSB), which systematically evaluates refusal calibration in realistic, context-rich dialog settings. Our benchmarks reveal that exaggerated refusals persist across diverse recent LLMs and are especially pronounced in complex, multi-turn scenarios. To mitigate these failures, we leverage post-hoc explanation methods to identify refusal triggers and deploy three lightweight, model-agnostic approaches, ignore-word instructions, prompt rephrasing, and attention steering, at inference time, all without retraining or parameter access. Experiments on four instruction-tuned Llama models demonstrate that these strategies substantially improve compliance on safe prompts while maintaining robust safety protections. Our findings establish a reproducible framework for diagnosing and mitigating exaggerated refusals, highlighting practical pathways to safer and more helpful LLM deployments.

翻译：大语言模型（LLMs）常产生虚假拒绝，即对包含类似不安全查询术语的良性请求予以拒绝。针对此问题，我们引入两个综合性基准：面向单轮提示的夸大安全基准（XSB），其标注了用于识别触发拒绝的“焦点”关键词；以及多轮场景化夸大安全基准（MS-XSB），系统评估现实复杂对话场景中的拒绝校准。我们的基准测试表明，夸大性拒绝在近期多种LLMs中普遍存在，且在复杂多轮场景中尤为显著。为缓解此类问题，我们利用后处理解释方法定位拒绝触发因素，并在推理时部署三种轻量级、模型无关的干预策略——忽略词指令、提示重述与注意力引导，且无需重新训练或访问模型参数。在四个经过指令微调的Llama模型上的实验表明，这些策略能显著提升对安全提示的遵从性，同时保持稳健的安全防护。本研究建立了一套可复现的诊断与缓解夸大性拒绝的框架，为更安全、更有助益的LLM部署指明了实用路径。