Large Language Models (LLMs) are trained to refuse to respond to harmful content. However, systematic analyses of whether this behavior is truly a reflection of its safety policies or an indication of political censorship, that is practiced globally by countries, is lacking. Differentiating between safety influenced refusals or politically motivated censorship is hard and unclear. For this purpose we introduce PSP, a dataset built specifically to probe the refusal behaviors in LLMs from an explicitly political context. PSP is built by formatting existing censored content from two data sources, openly available on the internet: sensitive prompts in China generalized to multiple countries, and tweets that have been censored in various countries. We study: 1) impact of political sensitivity in seven LLMs through data-driven (making PSP implicit) and representation-level approaches (erasing the concept of politics); and, 2) vulnerability of models on PSP through prompt injection attacks (PIAs). Associating censorship with refusals on content with masked implicit intent, we find that most LLMs perform some form of censorship. We conclude with summarizing major attributes that can cause a shift in refusal distributions across models and contexts of different countries.
翻译:大型语言模型(LLMs)被训练为拒绝响应有害内容。然而,对于这种行为究竟是真实反映其安全策略,还是全球各国普遍实施的政治审查的体现,目前缺乏系统性分析。区分受安全影响的拒绝行为与出于政治动机的审查行为既困难又不明确。为此,我们引入了PSP数据集,该数据集专门用于从明确的政治语境中探究LLMs的拒绝行为。PSP通过格式化来自两个互联网公开数据源的现有被审查内容构建而成:一是推广至多个国家的中国敏感提示,二是各国被审查的推文。我们研究了:1)通过数据驱动方法(使PSP隐含化)和表示层面方法(消除政治概念)分析政治敏感性对七种LLMs的影响;2)通过提示注入攻击(PIAs)探究模型在PSP上的脆弱性。通过将审查行为与对隐含意图被掩盖内容的拒绝相关联,我们发现大多数LLMs均实施某种形式的审查。最后,我们总结了可能导致不同国家模型和语境间拒绝分布差异的主要属性。