Reinforcement Learning (RL) can be considered as a sequence modeling task, i.e., given a sequence of past state-action-reward experiences, a model autoregressively predicts a sequence of future actions. Recently, Transformers have been successfully adopted to model this problem. In this work, we propose State-Action-Reward Transformer (StARformer), which explicitly models local causal relations to help improve action prediction in long sequences. StARformer first extracts local representations (i.e., StAR-representations) from each group of state-action-reward tokens within a very short time span. A sequence of such local representations combined with state representations, is then used to make action predictions over a long time span. Our experiments show that StARformer outperforms the state-of-the-art Transformer-based method on Atari (image) and Gym (state vector) benchmarks, in both offline-RL and imitation learning settings. StARformer is also more compliant with longer sequences of inputs compared to the baseline. Our code is available at https://github.com/elicassion/StARformer.
翻译:强化学习(RL)可被视为一个序列建模任务,即,鉴于过去的一系列国家行动回报经验,一个模型自动递增地预测未来行动的顺序。最近,已经成功地采纳了变异器来模拟这一问题。在这项工作中,我们提议了国家-行动回报变异器(Starfer),明确模拟当地因果关系,以帮助改进长序列的行动预测。追回被盗资产倡议先在很短的时间内从每组国家-行动回报标牌中抽取当地代表(即追回被盗资产代表)。这种地方代表器的顺序加上国家代表器,然后用来在很长的时期内作出行动预测。我们的实验显示,追回被盗资产先是超越了阿塔里(图像)和Gym(州矢量)基于状态的变异器基准,无论是在离线RL还是模拟学习环境。追回被盗资产倡议还更符合比基线更长的输入序列。我们的代码可在https://githhubub.com/elicas/Strave中查阅。