The adoption of reinforcement learning for critical infrastructure defense introduces a vulnerability where sophisticated attackers can strategically exploit the defense algorithm's learning dynamics. While prior work addresses this vulnerability in the context of repeated normal-form games, its extension to the stochastic games remains an open research gap. We close this gap by examining stochastic security games between an RL defender and an omniscient attacker, utilizing a tractable linear influence network model. To overcome the structural limitations of prior methods, we propose and apply neuro-dynamic programming. Our experimental results demonstrate that the omniscient attacker can significantly outperform a naive defender, highlighting the critical vulnerability introduced by the learning dynamics and the effectiveness of the proposed strategy.
翻译:在关键基础设施防御中采用强化学习引入了一种脆弱性,即复杂攻击者能够策略性地利用防御算法的学习动态。尽管先前的研究在重复标准形式博弈的背景下探讨了这一脆弱性,但其向随机博弈的扩展仍是一个开放的研究缺口。我们通过研究强化学习防御者与全知攻击者之间的随机安全博弈,利用一种可处理的线性影响网络模型,填补了这一缺口。为克服先前方法的结构性限制,我们提出并应用了神经动态规划方法。实验结果表明,全知攻击者能够显著优于天真防御者,突显了学习动态引入的关键脆弱性以及所提出策略的有效性。