用于未详细说明的环境的可实现的逆向强化强化学习 (Feasible Adversarial Robust Reinforcement Learning for Underspecified Environments)

Robust reinforcement learning (RL) considers the problem of learning policies that perform well in the worst case among a set of possible environment parameter values. In real-world environments, choosing the set of possible values for robust RL can be a difficult task. When that set is specified too narrowly, the agent will be left vulnerable to reasonable parameter values unaccounted for. When specified too broadly, the agent will be too cautious. In this paper, we propose Feasible Adversarial Robust RL (FARR), a method for automatically determining the set of environment parameter values over which to be robust. FARR implicitly defines the set of feasible parameter values as those on which an agent could achieve a benchmark reward given enough training resources. By formulating this problem as a two-player zero-sum game, FARR jointly learns an adversarial distribution over parameter values with feasible support and a policy robust over this feasible parameter set. Using the PSRO algorithm to find an approximate Nash equilibrium in this FARR game, we show that an agent trained with FARR is more robust to feasible adversarial parameter selection than with existing minimax, domain-randomization, and regret objectives in a parameterized gridworld and three MuJoCo control environments.

翻译：强化强力学习( RL) 考虑学习政策的问题, 在一系列可能的环境参数值中, 最差的一组环境参数值效果良好。在现实世界环境中, 为稳健的 RL 选择一组可能的值可能是一个困难的任务。当该组定义过窄时, 代理商会被忽略合理的参数值。如果定义过宽, 代理商会过于谨慎。在本文中, 我们提议一种方法, 自动确定一组环境参数值, 而该环境参数值是稳健的。 FARR 暗含地定义一套可行的参数值, 即一个代理商在有足够培训资源的情况下能够取得基准奖赏的参数值。通过将该问题描述为双玩零和游戏, FARR 将共同学习参数值的对抗性分布, 并且提供可行的支持, 并且该参数参数参数集的参数集参数集设置政策将非常有力。我们用 PSRO 算法来找到一个接近 Nash 平衡值的参数, 与现有的微型模型、域域网格化、 3 令人遗憾的目标相比, 。