We partially reverse-engineer a convolutional recurrent neural network (RNN) trained with model-free reinforcement learning to play the box-pushing game Sokoban. We find that the RNN stores future moves (plans) as activations in particular channels of the hidden state, which we call path channels. A high activation in a particular location means that, when a box is in that location, it will get pushed in the channel's assigned direction. We examine the convolutional kernels between path channels and find that they encode the change in position resulting from each possible action, thus representing part of a learned transition model. The RNN constructs plans by starting at the boxes and goals. These kernels extend activations in path channels forwards from boxes and backwards from the goal. Negative values are placed in channels at obstacles. This causes the extension kernels to propagate the negative value in reverse, thus pruning the last few steps and letting an alternative plan emerge; a form of backtracking. Our work shows that, a precise understanding of the plan representation allows us to directly understand the bidirectional planning-like algorithm learned by model-free training in more familiar terms.
翻译:我们部分逆向工程了一个通过无模型强化学习训练的卷积循环神经网络(RNN),该网络用于玩推箱子游戏。我们发现,RNN将未来移动(规划)存储为隐藏状态特定通道中的激活,我们称之为路径通道。特定位置的高激活意味着,当箱子位于该位置时,它将被沿通道指定方向推动。我们检查了路径通道之间的卷积核,发现它们编码了每个可能动作导致的位置变化,从而表示学习到的转移模型的一部分。RNN通过从箱子和目标位置开始构建规划。这些核从箱子向前、从目标向后扩展路径通道中的激活。障碍物处的通道被赋予负值。这导致扩展核反向传播负值,从而修剪最后几步并让替代规划出现;这是一种回溯形式。我们的工作表明,对规划表示的精确理解使我们能够以更熟悉的术语直接理解无模型训练所习得的双向类规划算法。