Deep reinforcement learning (DRL) has achieved remarkable success in a wide range of sequential decision-making applications, including robotics, healthcare, smart grids, and finance. Recent studies reveal that adversaries can implant backdoors into DRL agents during the training phase. These backdoors can later be activated by specific triggers during deployment, compelling the agent to execute targeted actions and potentially leading to severe consequences, such as drone crashes or vehicle collisions. However, existing backdoor attacks utilize simplistic and heuristic trigger configurations, overlooking the critical impact of trigger design on attack effectiveness. To address this gap, we introduce TooBadRL, the first framework to systematically optimize DRL backdoor triggers across three critical aspects: injection timing, trigger dimension, and manipulation magnitude. Specifically, we first introduce a performance-aware adaptive freezing mechanism to determine the injection timing during training. Then, we formulate trigger selection as an influence attribution problem and apply Shapley value analysis to identify the most influential trigger dimension for injection. Furthermore, we propose an adversarial input synthesis method to optimize the manipulation magnitude under environmental constraints. Extensive evaluations on three DRL algorithms and nine benchmark tasks demonstrate that TooBadRL outperforms five baseline methods in terms of attack success rate while only slightly affecting normal task performance. We further evaluate potential defense strategies from detection and mitigation perspectives. We open-source our code to facilitate reproducibility and further research.
翻译:深度强化学习(DRL)在机器人、医疗保健、智能电网和金融等广泛的序列决策应用中取得了显著成功。近期研究表明,攻击者可在训练阶段向DRL智能体中植入后门。这些后门在部署期间可通过特定触发器激活,迫使智能体执行目标动作,可能导致严重后果,如无人机坠毁或车辆碰撞。然而,现有后门攻击采用简单且启发式的触发器配置,忽视了触发器设计对攻击效能的决定性影响。为填补这一空白,我们提出了TooBadRL,首个系统优化DRL后门触发器的框架,涵盖三个关键维度:注入时机、触发器维度和操纵幅度。具体而言,我们首先引入性能感知的自适应冻结机制,以确定训练过程中的注入时机。随后,我们将触发器选择建模为影响归因问题,并应用Shapley值分析识别最具影响力的注入触发器维度。此外,我们提出一种对抗性输入合成方法,在环境约束下优化操纵幅度。在三种DRL算法和九个基准任务上的广泛评估表明,TooBadRL在攻击成功率上优于五种基线方法,且对正常任务性能影响甚微。我们进一步从检测与缓解角度评估了潜在防御策略。我们已开源代码以促进可复现性及后续研究。