非政策性深强化学习中的行动噪音:对探索和业绩的影响 (Action Noise in Off-Policy Deep Reinforcement Learning: Impact on Exploration and Performance)

Many deep reinforcement learning algorithms rely on simple forms of exploration, such as the additive action-noise often used in continuous control domains. Typically, the scaling factor of this action noise is chosen as a hyper-parameter and kept constant during training. In this paper, we analyze how the learned policy is impacted by the noise type, scale, and reducing of the scaling factor over time. We consider the two most prominent types of action-noise: Gaussian and Ornstein-Uhlenbeck noise, and perform a vast experimental campaign by systematically varying the noise type and scale parameter, and by measuring variables of interest like the expected return of the policy and the state space coverage during exploration. For the latter, we propose a novel state-space coverage measure $\operatorname{X}_{\mathcal{U}\text{rel}}$ that is more robust to boundary artifacts than previously proposed measures. Larger noise scales generally increase state space coverage. However, we found that increasing the space coverage using a larger noise scale is often not beneficial. On the contrary, reducing the noise-scale over the training process reduces the variance and generally improves the learning performance. We conclude that the best noise-type and scale are environment dependent, and based on our observations, derive heuristic rules for guiding the choice of the action noise as a starting point for further optimization.

翻译：许多深度强化学习算法依靠简单的探索形式,例如经常在连续控制域中使用的添加动作-噪音等。通常, 动作噪音的缩放因子被选为超参数, 并在训练期间保持不变。在本文中, 我们分析所学的政策如何受噪音类型、规模和一段时间内缩放因子的影响。我们考虑两种最突出的行动噪音类型: 高西亚和奥恩斯坦- 乌赫伦贝克噪音, 并通过系统地改变噪音类型和比例参数, 以及测量政策预期回报和勘探期间国家空间覆盖等利益变量, 进行大规模实验活动。对于后者, 我们提议一个新的州空间覆盖度措施 $\ operatorname{ X ⁇ mathal{ U ⁇ text{rel ⁇ $, 这比以前提议的措施更能动。更大的噪音尺度一般会增加州空间覆盖范围。但是, 我们发现, 使用更大的噪音规模来增加空间覆盖往往没有好处。相反, 减少培训过程的噪音规模会减少差异, 并且一般地改进我们开始的定位的定位, 。我们的结论是, 最佳的噪音和以学习速度。