模型安全论文 - 专知

会员服务 ·

模型安全

SafeRBench: A Comprehensive Benchmark for Safety Assessment in Large Reasoning Models

Arxiv

0+阅读 · 11月19日

ProSocialAlign: Preference Conditioned Test Time Alignment in Language Models

Arxiv

0+阅读 · 12月6日

Beyond Fixed and Dynamic Prompts: Embedded Jailbreak Templates for Advancing LLM Security

Arxiv

0+阅读 · 11月18日

Differentiated Directional Intervention A Framework for Evading LLM Safety Alignment

Arxiv

0+阅读 · 11月24日

Differentiated Directional Intervention A Framework for Evading LLM Safety Alignment

Arxiv

0+阅读 · 11月17日

Differentiated Directional Intervention A Framework for Evading LLM Safety Alignment

Arxiv

0+阅读 · 11月11日

Differentiated Directional Intervention A Framework for Evading LLM Safety Alignment

Arxiv

0+阅读 · 11月10日

Chasing Shadows: Pitfalls in LLM Security Research

Arxiv

0+阅读 · 12月10日

Chasing Shadows: Pitfalls in LLM Security Research

Arxiv

0+阅读 · 12月15日

MCP-SafetyBench: A Benchmark for Safety Evaluation of Large Language Models with Real-World MCP Servers

Arxiv

0+阅读 · 12月17日

Beyond SFT: Reinforcement Learning for Safer Large Reasoning Models with Better Reasoning Ability

Arxiv

0+阅读 · 12月1日

Efficient LLM Safety Evaluation through Multi-Agent Debate

Arxiv

0+阅读 · 11月9日

Towards Resilient Safety-driven Unlearning for Diffusion Models against Downstream Fine-tuning

Arxiv

0+阅读 · 12月6日

ORFuzz: Fuzzing the "Other Side" of LLM Safety -- Testing Over-Refusal

Arxiv

0+阅读 · 12月5日

参考链接

微信扫码咨询专知VIP会员