与加强学习一起产生建议 (Generative Slate Recommendation with Reinforcement Learning)

Recent research has employed reinforcement learning (RL) algorithms to optimize long-term user engagement in recommender systems, thereby avoiding common pitfalls such as user boredom and filter bubbles. They capture the sequential and interactive nature of recommendations, and thus offer a principled way to deal with long-term rewards and avoid myopic behaviors. However, RL approaches are intractable in the slate recommendation scenario - where a list of items is recommended at each interaction turn - due to the combinatorial action space. In that setting, an action corresponds to a slate that may contain any combination of items. While previous work has proposed well-chosen decompositions of actions so as to ensure tractability, these rely on restrictive and sometimes unrealistic assumptions. Instead, in this work we propose to encode slates in a continuous, low-dimensional latent space learned by a variational auto-encoder. Then, the RL agent selects continuous actions in this latent space, which are ultimately decoded into the corresponding slates. By doing so, we are able to (i) relax assumptions required by previous work, and (ii) improve the quality of the action selection by modeling full slates instead of independent items, in particular by enabling diversity. Our experiments performed on a wide array of simulated environments confirm the effectiveness of our generative modeling of slates over baselines in practical scenarios where the restrictive assumptions underlying the baselines are lifted. Our findings suggest that representation learning using generative models is a promising direction towards generalizable RL-based slate recommendation.

翻译：最近的研究采用了强化学习(RL)算法,优化建议系统的长期用户参与,从而避免用户无聊和过滤泡泡等常见的陷阱。它们捕捉了建议的顺序和互动性质,从而提供了处理长期奖励和避免近视行为的原则性方法。然而,由于组合行动空间的组合式行动空间,每次互动都建议项目清单,因此在列表建议情景中,RL方法是棘手的。在这种组合行动空间中,一个行动对应一个可能包含任何项目组合的板块。虽然以前的工作曾提议对行动进行精心选择的分解,以确保可调适性,但这些假设依赖于限制性和有时不切实际的假设。相反,我们在此工作中提议将长期的、低维潜在空间编码成一个由变式自动编码器所学获得的连续的、低维度的潜在空间。然后,RL代理在这种潜在空间中选择持续的行动,最终被解译为相应的标准。通过这样做,我们能够(i) 放松以前的工作所要求的假设,以确保可调易行,这些假设依赖于限制性的、有时不切实际的假设。改进我们基础的模型的模型,从而改进了我们总体的模型的模型的模型,从而改进了我们基础选择的精度,从而确认整个的模型的模型的模型的模型的精度。