一种基于专有模型的AI智能体安全响应框架 (A Proprietary Model-Based Safety Response Framework for AI Agents)

With the widespread application of Large Language Models (LLMs), their associated security issues have become increasingly prominent, severely constraining their trustworthy deployment in critical domains. This paper proposes a novel safety response framework designed to systematically safeguard LLMs at both the input and output levels. At the input level, the framework employs a supervised fine-tuning-based safety classification model. Through a fine-grained four-tier taxonomy (Safe, Unsafe, Conditionally Safe, Focused Attention), it performs precise risk identification and differentiated handling of user queries, significantly enhancing risk coverage and business scenario adaptability, and achieving a risk recall rate of 99.3%. At the output level, the framework integrates Retrieval-Augmented Generation (RAG) with a specifically fine-tuned interpretation model, ensuring all responses are grounded in a real-time, trustworthy knowledge base. This approach eliminates information fabrication and enables result traceability. Experimental results demonstrate that our proposed safety control model achieves a significantly higher safety score on public safety evaluation benchmarks compared to the baseline model, TinyR1-Safety-8B. Furthermore, on our proprietary high-risk test set, the framework's components attained a perfect 100% safety score, validating their exceptional protective capabilities in complex risk scenarios. This research provides an effective engineering pathway for building high-security, high-trust LLM applications.

翻译：随着大语言模型（LLMs）的广泛应用，其相关的安全问题日益凸显，严重制约了其在关键领域中的可信部署。本文提出了一种新颖的安全响应框架，旨在从输入和输出两个层面系统性地保障LLMs的安全性。在输入层面，该框架采用基于监督微调的安全分类模型，通过细粒度的四级分类体系（安全、不安全、条件安全、重点关注），对用户查询进行精准的风险识别与差异化处理，显著提升了风险覆盖率和业务场景适应性，实现了99.3%的风险召回率。在输出层面，框架将检索增强生成（RAG）与专门微调的解释模型相结合，确保所有回答均基于实时可信的知识库，从而杜绝信息捏造并实现结果可追溯性。实验结果表明，在公共安全评估基准上，我们提出的安全控制模型相比基线模型TinyR1-Safety-8B获得了显著更高的安全评分。此外，在我们专有的高风险测试集上，框架各组件均取得了100%的安全评分，验证了其在复杂风险场景下卓越的防护能力。本研究为构建高安全性、高可信度的LLM应用提供了有效的工程化路径。

相关内容

关注 7073

人工智能杂志AI(Artificial Intelligence)是目前公认的发表该领域最新研究成果的主要国际论坛。该期刊欢迎有关AI广泛方面的论文，这些论文构成了整个领域的进步，也欢迎介绍人工智能应用的论文，但重点应该放在新的和新颖的人工智能方法如何提高应用领域的性能，而不是介绍传统人工智能方法的另一个应用。关于应用的论文应该描述一个原则性的解决方案，强调其新颖性，并对正在开发的人工智能技术进行深入的评估。官网地址：http://dblp.uni-trier.de/db/journals/ai/

FlowQA: Grasping Flow in History for Conversational Machine Comprehension

专知会员服务

34+阅读 · 2019年10月18日

Auto-Sizing the Transformer Network: Improving Speed, Efficiency, and Performance for Low-Resource Machine Translation

专知会员服务

50+阅读 · 2019年10月17日

Connections between Support Vector Machines, Wasserstein distance and gradient-penalty GANs

专知会员服务

36+阅读 · 2019年10月17日

Stabilizing Transformers for Reinforcement Learning

专知会员服务

60+阅读 · 2019年10月17日