Backdoor attacks, or trojans, pose a security risk by concealing undesirable behavior in deep neural network models. Open-source neural networks are downloaded from the internet daily, possibly containing backdoors, and third-party model developers are common. To advance research on backdoor attack mitigation, we develop several trojans for deep reinforcement learning (DRL) agents. We focus on in-distribution triggers, which occur within the agent's natural data distribution, since they pose a more significant security threat than out-of-distribution triggers due to their ease of activation by the attacker during model deployment. We implement backdoor attacks in four reinforcement learning (RL) environments: LavaWorld, Randomized LavaWorld, Colorful Memory, and Modified Safety Gymnasium. We train various models, both clean and backdoored, to characterize these attacks. We find that in-distribution triggers can require additional effort to implement and be more challenging for models to learn, but are nevertheless viable threats in DRL even using basic data poisoning attacks.
翻译:后门攻击(或称木马攻击)通过在深度神经网络模型中隐藏恶意行为,构成安全威胁。开源神经网络模型每日从互联网下载,可能包含后门,且第三方模型开发十分普遍。为推进后门攻击防御研究,我们针对深度强化学习(DRL)智能体开发了多种木马攻击。我们重点关注分布内触发器——这些触发器出现在智能体自然数据分布范围内,由于攻击者在模型部署期间更容易激活它们,因此比分布外触发器构成更严重的安全威胁。我们在四个强化学习(RL)环境中实现了后门攻击:LavaWorld、Randomized LavaWorld、Colorful Memory 和 Modified Safety Gymnasium。我们训练了多种清洁模型与后门模型以表征这些攻击。研究发现,分布内触发器可能需要额外实施成本且模型学习难度更高,但即使使用基础数据投毒攻击,它们仍是DRL中切实可行的威胁。