猜测还是回忆？训练CNN分类与定位大语言模型中的记忆现象 (Guess or Recall? Training CNNs to Classify and Localize Memorization in LLMs)

Verbatim memorization in Large Language Models (LLMs) is a multifaceted phenomenon involving distinct underlying mechanisms. We introduce a novel method to analyze the different forms of memorization described by the existing taxonomy. Specifically, we train Convolutional Neural Networks (CNNs) on the attention weights of the LLM and evaluate the alignment between this taxonomy and the attention weights involved in decoding. We find that the existing taxonomy performs poorly and fails to reflect distinct mechanisms within the attention blocks. We propose a new taxonomy that maximizes alignment with the attention weights, consisting of three categories: memorized samples that are guessed using language modeling abilities, memorized samples that are recalled due to high duplication in the training set, and non-memorized samples. Our results reveal that few-shot verbatim memorization does not correspond to a distinct attention mechanism. We also show that a significant proportion of extractable samples are in fact guessed by the model and should therefore be studied separately. Finally, we develop a custom visual interpretability technique to localize the regions of the attention weights involved in each form of memorization.

翻译：大语言模型（LLMs）中的逐字记忆是一种多层面现象，涉及不同的潜在机制。我们提出一种新方法来分析现有分类法所描述的不同记忆形式。具体而言，我们在LLM的注意力权重上训练卷积神经网络（CNNs），并评估该分类法与解码过程中涉及的注意力权重之间的对齐程度。我们发现现有分类法表现不佳，未能反映注意力块内的不同机制。我们提出一种新的分类法，以最大化与注意力权重的对齐，包含三类：利用语言建模能力猜测的记忆样本、因训练集高度重复而回忆的记忆样本，以及非记忆样本。我们的结果表明，少样本逐字记忆并不对应独特的注意力机制。我们还发现，相当一部分可提取样本实际上是由模型猜测产生的，因此应单独研究。最后，我们开发了一种定制可视化可解释性技术，以定位每种记忆形式所涉及的注意力权重区域。