The task of reconstructing unknown textual inputs to language models is a fundamental auditing primitive that allows us to assess the model's vulnerability to a range of security issues, including stealing hidden system prompts, detecting backdoors, and leaking private data. Existing inversion works assume access to differing levels of information (e.g. requiring input-output examples, the model parameters, intermediate activations or output logits) but oftentimes fail to fully reconstruct the desired input. In this paper, we present the Sparse One-hot Discrete Adam (SODA) algorithm, a search-based inversion method that can accurately reconstruct the input text, given white-box access to the language model and its output. Our experiments demonstrate for the first time that exact language model inversion is possible on both natural language and random inputs. Indeed, SODA achieves respectively 98% and 79% reconstruction rates on inputs with lengths up to 10 tokens. Furthermore, we show that input length and vocabulary size have a far greater impact on the probability of a successful reconstruction than the size of the language model itself, thus allowing us to scale to models from 33M to 3B parameters.
翻译:重构语言模型未知文本输入的任务是一项基础审计原语,它使我们能够评估模型对一系列安全问题的脆弱性,包括窃取隐藏系统提示、检测后门以及泄露私有数据。现有反演工作假设可获取不同级别的信息(例如需要输入-输出示例、模型参数、中间激活值或输出对数),但往往无法完全重构目标输入。本文提出稀疏独热离散Adam(SODA)算法,这是一种基于搜索的反演方法,在给定对语言模型及其输出的白盒访问权限下,能够准确重构输入文本。我们的实验首次证明,对自然语言和随机输入进行精确的语言模型反演是可行的。实际上,SODA在长度不超过10个标记的输入上分别实现了98%和79%的重构率。此外,我们发现输入长度和词汇表大小对成功重构概率的影响远大于语言模型本身的规模,这使得我们能够扩展到参数规模从3300万到30亿的模型。