基于上下文词典序奖励偏好的多目标规划 (Multi-Objective Planning with Contextual Lexicographic Reward Preferences)

from arxiv, 9 pages, 5 figures, 2 tables, To appear in Proceedings of the 24th International Conference on Autonomous Agents and Multiagent Systems (AAMAS) 2025

Autonomous agents are often required to plan under multiple objectives whose preference ordering varies based on context. The agent may encounter multiple contexts during its course of operation, each imposing a distinct lexicographic ordering over the objectives, with potentially different reward functions associated with each context. Existing approaches to multi-objective planning typically consider a single preference ordering over the objectives, across the state space, and do not support planning under multiple objective orderings within an environment. We present Contextual Lexicographic Markov Decision Process (CLMDP), a framework that enables planning under varying lexicographic objective orderings, depending on the context. In a CLMDP, both the objective ordering at a state and the associated reward functions are determined by the context. We employ a Bayesian approach to infer a state-context mapping from expert trajectories. Our algorithm to solve a CLMDP first computes a policy for each objective ordering and then combines them into a single context-aware policy that is valid and cycle-free. The effectiveness of the proposed approach is evaluated in simulation and using a mobile robot.

翻译：自主智能体通常需要在多目标下进行规划，这些目标的偏好排序会随上下文变化。智能体在运行过程中可能遇到多种上下文，每种上下文对目标施加不同的词典序排序，且可能对应不同的奖励函数。现有的多目标规划方法通常考虑在整个状态空间中采用单一的目标偏好排序，无法支持在同一环境中基于多种目标排序进行规划。本文提出上下文词典序马尔可夫决策过程（CLMDP），该框架支持根据不同上下文进行词典序目标排序的规划。在CLMDP中，状态的词典序目标排序及其关联的奖励函数均由上下文决定。我们采用贝叶斯方法从专家轨迹中推断状态-上下文映射关系。求解CLMDP的算法首先为每种目标排序计算策略，随后将其组合为单一且无环的上下文感知策略。通过仿真和移动机器人实验验证了所提方法的有效性。