Goal-Conditioned Reinforcement Learning (GCRL) enables agents to autonomously acquire diverse behaviors, but faces major challenges in visual environments due to high-dimensional, semantically sparse observations. In the online setting, where agents learn representations while exploring, the latent space evolves with the agent's policy, to capture newly discovered areas of the environment. However, without incentivization to maximize state coverage in the representation, classical approaches based on auto-encoders may converge to latent spaces that over-represent a restricted set of states frequently visited by the agent. This is exacerbated in an intrinsic motivation setting, where the agent uses the distribution encoded in the latent space to sample the goals it learns to master. To address this issue, we propose to progressively enforce distributional shifts towards a uniform distribution over the full state space, to ensure a full coverage of skills that can be learned in the environment. We introduce DRAG (Distributionally Robust Auto-Encoding for GCRL), a method that combines the $\beta$-VAE framework with Distributionally Robust Optimization. DRAG leverages an adversarial neural weighter of training states of the VAE, to account for the mismatch between the current data distribution and unseen parts of the environment. This allows the agent to construct semantically meaningful latent spaces beyond its immediate experience. Our approach improves state space coverage and downstream control performance on hard exploration environments such as mazes and robotic control involving walls to bypass, without pre-training nor prior environment knowledge.
翻译:目标条件强化学习(GCRL)使智能体能够自主习得多样化行为,但在视觉环境中由于高维、语义稀疏的观测而面临重大挑战。在在线学习场景中,智能体在探索的同时学习表示,其潜在空间随策略演化以捕捉环境中新发现的区域。然而,若缺乏最大化表示中状态覆盖的激励,基于自编码器的经典方法可能收敛至过度表示智能体频繁访问的受限状态集的潜在空间。这在内在动机设置中更为严重,其中智能体使用潜在空间编码的分布来采样其学习掌握的目标。为解决此问题,我们提出逐步强制分布向整个状态空间上的均匀分布偏移,以确保环境中可学习技能的全面覆盖。我们引入DRAG(用于GCRL的分布鲁棒自编码),该方法将$\beta$-VAE框架与分布鲁棒优化相结合。DRAG利用VAE训练状态的对抗性神经加权器,以处理当前数据分布与环境未知部分之间的不匹配。这使得智能体能够构建超越其即时经验的语义有意义的潜在空间。我们的方法在迷宫和涉及需绕行墙壁的机器人控制等硬探索环境中,无需预训练或先验环境知识,即可提升状态空间覆盖和下游控制性能。