GDPR-Bench-Android：用于评估Android应用中自动化GDPR合规性检测的基准 (GDPR-Bench-Android: A Benchmark for Evaluating Automated GDPR Compliance Detection in Android)

Automating the detection of EU General Data Protection Regulation (GDPR) violations in source code is a critical but underexplored challenge. We introduce \textbf{GDPR-Bench-Android}, the first comprehensive benchmark for evaluating diverse automated methods for GDPR compliance detection in Android applications. It contains \textbf{1951} manually annotated violation instances from \textbf{15} open-source repositories, covering 23 GDPR articles at file-, module-, and line-level granularities. To enable a multi-paradigm evaluation, we contribute \textbf{Formal-AST}, a novel, source-code-native formal method that serves as a deterministic baseline. We define two tasks: (1) \emph{multi-granularity violation localization}, evaluated via Accuracy@\textit{k}; and (2) \emph{snippet-level multi-label classification}, assessed by macro-F1 and other classification metrics. We benchmark 11 methods, including eight state-of-the-art LLMs, our Formal-AST analyzer, a retrieval-augmented (RAG) method, and an agentic (ReAct) method. Our findings reveal that no single paradigm excels across all tasks. For Task 1, the ReAct agent achieves the highest file-level Accuracy@1 (17.38%), while the Qwen2.5-72B LLM leads at the line level (61.60%), in stark contrast to the Formal-AST method's 1.86%. For the difficult multi-label Task 2, the Claude-Sonnet-4.5 LLM achieves the best Macro-F1 (5.75%), while the RAG method yields the highest Macro-Precision (7.10%). These results highlight the task-dependent strengths of different automated approaches and underscore the value of our benchmark in diagnosing their capabilities. All resources are available at: https://github.com/Haoyi-Zhang/GDPR-Bench-Android.

翻译：在源代码中自动化检测欧盟《通用数据保护条例》（GDPR）违规行为是一个关键但尚未充分探索的挑战。我们提出了 **GDPR-Bench-Android**，这是首个用于评估Android应用中多种自动化GDPR合规性检测方法的综合性基准。它包含来自 **15** 个开源存储库的 **1951** 个手动标注的违规实例，覆盖了23项GDPR条款，粒度涵盖文件级、模块级和行级。为了实现多范式评估，我们贡献了 **Formal-AST**，这是一种新颖的、源代码原生的形式化方法，可作为确定性基线。我们定义了两项任务：(1) **多粒度违规定位**，通过Accuracy@\textit{k}进行评估；(2) **代码片段级多标签分类**，通过宏平均F1分数（macro-F1）及其他分类指标进行评估。我们对11种方法进行了基准测试，包括8种最先进的大型语言模型（LLM）、我们的Formal-AST分析器、一种检索增强生成（RAG）方法和一种智能体（ReAct）方法。我们的研究结果表明，没有单一范式在所有任务中都表现出色。对于任务1，ReAct智能体在文件级Accuracy@1上取得了最高分（17.38%），而Qwen2.5-72B LLM在行级领先（61.60%），这与Formal-AST方法的1.86%形成鲜明对比。对于困难的多标签任务2，Claude-Sonnet-4.5 LLM取得了最佳的宏平均F1分数（5.75%），而RAG方法则获得了最高的宏平均精确率（7.10%）。这些结果凸显了不同自动化方法在不同任务上的优势，并强调了我们的基准在诊断其能力方面的价值。所有资源可在以下网址获取：https://github.com/Haoyi-Zhang/GDPR-Bench-Android。