Offline reinforcement learning often requires a quality dataset that we can train a policy on. However, in many situations, it is not possible to get such a dataset, nor is it easy to train a policy to perform well in the actual environment given the offline data. We propose using data distillation to train and distill a better dataset which can then be used for training a better policy model. We show that our method is able to synthesize a dataset where a model trained on it achieves similar performance to a model trained on the full dataset or a model trained using percentile behavioral cloning. Our project site is available at https://datasetdistillation4rl.github.io . We also provide our implementation at https://github.com/ggflow123/DDRL .
翻译:离线强化学习通常需要一个高质量的数据集来训练策略。然而,在许多情况下,获取这样的数据集并不容易,同时基于离线数据训练出在实际环境中表现优异的策略也颇具挑战。我们提出利用数据蒸馏技术来训练并蒸馏出更优的数据集,进而用于训练性能更强的策略模型。实验表明,我们的方法能够合成一个数据集,使得基于该数据集训练的模型在性能上接近基于完整数据集训练的模型,或达到基于百分位数行为克隆方法训练的模型水平。项目网站位于 https://datasetdistillation4rl.github.io,相关实现代码已开源:https://github.com/ggflow123/DDRL。