Large Language Models (LLMs) have attained human-level fluency in text generation, which complicates the distinguishing between human-written and LLM-generated texts. This increases the risk of misuse and highlights the need for reliable detectors. Yet, existing detectors exhibit poor robustness on out-of-distribution (OOD) data and attacked data, which is critical for real-world scenarios. Also, they struggle to provide interpretable evidence to support their decisions, thus undermining the reliability. In light of these challenges, we propose IPAD (Inverse Prompt for AI Detection), a novel framework consisting of a Prompt Inverter that identifies predicted prompts that could have generated the input text, and two Distinguishers that examine the probability that the input texts align with the predicted prompts. Empirical evaluations demonstrate that IPAD outperforms the strongest baselines by 9.05% (Average Recall) on in-distribution data, 12.93% (AUROC) on out-of-distribution data, and 5.48% (AUROC) on attacked data. IPAD also performs robustly on structured datasets. Furthermore, an interpretability assessment is conducted to illustrate that IPAD enhances the AI detection trustworthiness by allowing users to directly examine the decision-making evidence, which provides interpretable support for its state-of-the-art detection results.
翻译:大语言模型(LLMs)在文本生成方面已达到人类水平的流畅度,这使得区分人类书写文本与LLM生成文本变得复杂。这增加了滥用的风险,并凸显了对可靠检测器的需求。然而,现有检测器在分布外(OOD)数据和受攻击数据上表现出较差的鲁棒性,而这在实际应用场景中至关重要。此外,它们难以提供可解释的证据来支持其决策,从而削弱了可靠性。针对这些挑战,我们提出了IPAD(用于AI检测的逆向提示),这是一个新颖的框架,包含一个提示反演器,用于识别可能生成输入文本的预测提示,以及两个鉴别器,用于检查输入文本与预测提示对齐的概率。实证评估表明,IPAD在分布内数据上优于最强基线9.05%(平均召回率),在分布外数据上优于12.93%(AUROC),在受攻击数据上优于5.48%(AUROC)。IPAD在结构化数据集上也表现出鲁棒性。此外,通过可解释性评估表明,IPAD通过允许用户直接检查决策证据,增强了AI检测的可信度,这为其最先进的检测结果提供了可解释的支持。