Large language models, trained on massive corpora, are prone to verbatim memorization of training data, creating significant privacy and copyright risks. While previous works have proposed various definitions for memorization, many exhibit shortcomings in comprehensively capturing this phenomenon, especially in aligned models. To address this, we introduce a novel framework: multi-prefix memorization. Our core insight is that memorized sequences are deeply encoded and thus retrievable via a significantly larger number of distinct prefixes than non-memorized content. We formalize this by defining a sequence as memorized if an external adversarial search can identify a target count of distinct prefixes that elicit it. This framework shifts the focus from single-path extraction to quantifying the robustness of a memory, measured by the diversity of its retrieval paths. Through experiments on open-source and aligned chat models, we demonstrate that our multi-prefix definition reliably distinguishes memorized from non-memorized data, providing a robust and practical tool for auditing data leakage in LLMs.
翻译:在大型语料库上训练的大型语言模型容易逐字记忆训练数据,从而带来显著的隐私和版权风险。尽管先前的研究提出了多种记忆的定义,但许多定义在全面捕捉这一现象方面存在不足,尤其是在对齐模型中。为解决此问题,我们引入了一种新颖的框架:多前缀记忆。我们的核心见解是,被记忆的序列被深度编码,因此可通过比非记忆内容显著更多的不同前缀来检索。我们通过形式化定义:如果一个外部对抗性搜索能够识别出足以引发该序列的目标数量的不同前缀,则该序列被视为被记忆。该框架将焦点从单一路径提取转向量化记忆的稳健性,通过其检索路径的多样性来衡量。通过对开源和对齐聊天模型的实验,我们证明了我们的多前缀定义能够可靠地区分记忆数据与非记忆数据,为审计大型语言模型中的数据泄露提供了一个稳健且实用的工具。