With the widespread application of Large Language Models (LLMs), their associated security issues have become increasingly prominent, severely constraining their trustworthy deployment in critical domains. This paper proposes a novel safety response framework designed to systematically safeguard LLMs at both the input and output levels. At the input level, the framework employs a supervised fine-tuning-based safety classification model. Through a fine-grained four-tier taxonomy (Safe, Unsafe, Conditionally Safe, Focused Attention), it performs precise risk identification and differentiated handling of user queries, significantly enhancing risk coverage and business scenario adaptability, and achieving a risk recall rate of 99.3%. At the output level, the framework integrates Retrieval-Augmented Generation (RAG) with a specifically fine-tuned interpretation model, ensuring all responses are grounded in a real-time, trustworthy knowledge base. This approach eliminates information fabrication and enables result traceability. Experimental results demonstrate that our proposed safety control model achieves a significantly higher safety score on public safety evaluation benchmarks compared to the baseline model, TinyR1-Safety-8B. Furthermore, on our proprietary high-risk test set, the framework's components attained a perfect 100% safety score, validating their exceptional protective capabilities in complex risk scenarios. This research provides an effective engineering pathway for building high-security, high-trust LLM applications.
翻译:随着大语言模型(LLMs)的广泛应用,其相关的安全问题日益凸显,严重制约了其在关键领域中的可信部署。本文提出了一种新颖的安全响应框架,旨在从输入和输出两个层面系统性地保障LLMs的安全性。在输入层面,该框架采用基于监督微调的安全分类模型,通过细粒度的四级分类体系(安全、不安全、条件安全、重点关注),对用户查询进行精准的风险识别与差异化处理,显著提升了风险覆盖率和业务场景适应性,实现了99.3%的风险召回率。在输出层面,框架将检索增强生成(RAG)与专门微调的解释模型相结合,确保所有回答均基于实时可信的知识库,从而杜绝信息捏造并实现结果可追溯性。实验结果表明,在公共安全评估基准上,我们提出的安全控制模型相比基线模型TinyR1-Safety-8B获得了显著更高的安全评分。此外,在我们专有的高风险测试集上,框架各组件均取得了100%的安全评分,验证了其在复杂风险场景下卓越的防护能力。本研究为构建高安全性、高可信度的LLM应用提供了有效的工程化路径。