从强盗反馈中吸取多任务非政策性学习 (Multi-Task Off-Policy Learning from Bandit Feedback)

Many practical applications, such as recommender systems and learning to rank, involve solving multiple similar tasks. One example is learning of recommendation policies for users with similar movie preferences, where the users may still rank the individual movies slightly differently. Such tasks can be organized in a hierarchy, where similar tasks are related through a shared structure. In this work, we formulate this problem as a contextual off-policy optimization in a hierarchical graphical model from logged bandit feedback. To solve the problem, we propose a hierarchical off-policy optimization algorithm (HierOPO), which estimates the parameters of the hierarchical model and then acts pessimistically with respect to them. We instantiate HierOPO in linear Gaussian models, for which we also provide an efficient implementation and analysis. We prove per-task bounds on the suboptimality of the learned policies, which show a clear improvement over not using the hierarchical model. We also evaluate the policies empirically. Our theoretical and empirical results show a clear advantage of using the hierarchy over solving each task independently.

翻译：许多实际应用,例如推荐制度和学习排名等,都涉及解决多种相似的任务。一个例子是为具有类似电影偏好的用户学习推荐政策,用户对个别电影的排名可能仍然略有不同。这些任务可以按等级排列,通过一个共同的结构将相似的任务联系起来。在这项工作中,我们用记录土匪反馈的分层图形模型,将这个问题作为一种背景的脱离政策优化来表述。为了解决问题,我们提出了一个等级分级的政策优化算法(HierOPO),该算法对等级模式的参数进行估计,然后对之采取悲观行动。我们用线性高斯模型对HierOPO进行即时推,我们也为这些模型提供有效的执行和分析。我们证明,在所学政策的次优劣性上,这显示了与不使用等级模型的明显改进。我们还从经验上评估了政策。我们的理论和经验结果表明,使用等级来独立解决每项任务的明显优势。