使用隐藏 Markov 模型学习学习任务 Automata 用隐藏 Markov 模型加强学习 (Learning Task Automata for Reinforcement Learning using Hidden Markov Models)

Training reinforcement learning (RL) agents using scalar reward signals is often infeasible when an environment has sparse and non-Markovian rewards. Moreover, handcrafting these reward functions before training is prone to misspecification, especially when the environment's dynamics are only partially known. This paper proposes a novel pipeline for learning non-Markovian task specifications as succinct finite-state `task automata' from episodes of agent experience within unknown environments. We leverage two key algorithmic insights. First, we learn a product MDP, a model composed of the specification's automaton and the environment's MDP (both initially unknown), by treating the product MDP as a partially observable MDP and using the well-known Baum-Welch algorithm for learning hidden Markov models. Second, we propose a novel method for distilling the task automaton (assumed to be a deterministic finite automaton) from the learnt product MDP. Our learnt task automaton enables the decomposition of a task into its constituent sub-tasks, which improves the rate at which an RL agent can later synthesise an optimal policy. It also provides an interpretable encoding of high-level environmental and task features, so a human can readily verify that the agent has learnt coherent tasks with no misspecifications. In addition, we take steps towards ensuring that the learnt automaton is environment-agnostic, making it well-suited for use in transfer learning. Finally, we provide experimental results compared with two baselines to illustrate our algorithm's performance in different environments and tasks.

翻译：使用 scalar 奖励信号的强化培训代理机构使用 scalar 奖赏奖励信号在环境稀少和非 Markovian 奖赏时往往不可行。此外, 在培训前手工制作这些奖赏功能容易被错误地区分, 特别是当环境动态仅部分为已知时。本文提出一个新的管道, 用于学习非Markovian 任务规格, 即根据未知环境中的代理经验, 简单化的限定状态 `task automata' 。我们利用两种关键的算法洞察。首先, 我们学习了一种产品 MDP, 由规格的自动地图和环境 MDP (最初为未知) 构成的模型, 将产品 MDP 视为部分可观测的 MDP, 使用众所周知的 Baum- Welch 算法算法来学习隐藏的 Markov 模型。其次, 我们提出一种新的方法, 将任务( 被认为具有确定性的) 自动图解说, 我们所学的自动图解说, 也可以将一个快速化的动作转换到。