Deep reinforcement learning is a promising approach to training a dialog manager, but current methods struggle with the large state and action spaces of multi-domain dialog systems. Building upon Deep Q-learning from Demonstrations (DQfD), an algorithm that scores highly in difficult Atari games, we leverage dialog data to guide the agent to successfully respond to a user's requests. We make progressively fewer assumptions about the data needed, using labeled, reduced-labeled, and even unlabeled data to train expert demonstrators. We introduce Reinforced Fine-tune Learning, an extension to DQfD, enabling us to overcome the domain gap between the datasets and the environment. Experiments in a challenging multi-domain dialog system framework validate our approaches, and get high success rates even when trained on out-of-domain data.
翻译:深层强化学习是培训对话管理者的一个很有希望的方法,但当前的方法与多域对话系统的大型状态和行动空间挣扎着。 在演示(DQfD)的深Q学习(DQfD)这一在困难的Atari游戏中得分很高的算法的基础上,我们利用对话数据指导代理成功响应用户的要求。我们用标签、降低标签甚至未贴标签的数据逐步减少对所需数据的假设,以培训专家示威者。我们引入了强化微调学习,这是DQfD的延伸,使我们能够克服数据集与环境之间的域间差距。在具有挑战性的多域对话系统框架的实验中验证了我们的方法,并获得高成功率,即使培训了外部数据。