We present Masked Generative Policy (MGP), a novel framework for visuomotor imitation learning. We represent actions as discrete tokens, and train a conditional masked transformer that generates tokens in parallel and then rapidly refines only low-confidence tokens. We further propose two new sampling paradigms: MGP-Short, which performs parallel masked generation with score-based refinement for Markovian tasks, and MGP-Long, which predicts full trajectories in a single pass and dynamically refines low-confidence action tokens based on new observations. With globally coherent prediction and robust adaptive execution capabilities, MGP-Long enables reliable control on complex and non-Markovian tasks that prior methods struggle with. Extensive evaluations on 150 robotic manipulation tasks spanning the Meta-World and LIBERO benchmarks show that MGP achieves both rapid inference and superior success rates compared to state-of-the-art diffusion and autoregressive policies. Specifically, MGP increases the average success rate by 9% across 150 tasks while cutting per-sequence inference time by up to 35x. It further improves the average success rate by 60% in dynamic and missing-observation environments, and solves two non-Markovian scenarios where other state-of-the-art methods fail.
翻译:本文提出了一种新颖的视觉运动模仿学习框架——掩码生成策略(MGP)。我们将动作表示为离散标记,并训练一个条件掩码Transformer,该模型并行生成标记,随后仅对低置信度标记进行快速精炼。我们进一步提出了两种新的采样范式:MGP-Short,针对马尔可夫任务执行基于分数的并行掩码生成与精炼;以及MGP-Long,通过单次前向预测完整轨迹,并根据新观测动态精炼低置信度动作标记。凭借全局一致的预测能力和鲁棒的自适应执行特性,MGP-Long能够在现有方法难以处理的复杂非马尔可夫任务上实现可靠控制。在涵盖Meta-World和LIBERO基准的150项机器人操作任务上进行广泛评估,结果表明,与最先进的扩散策略和自回归策略相比,MGP在实现快速推理的同时获得了更高的成功率。具体而言,MGP在150项任务中将平均成功率提升了9%,同时将每序列推理时间缩短了高达35倍。在动态和缺失观测环境中,其平均成功率进一步提高了60%,并成功解决了两种其他先进方法均失败的非马尔可夫场景。