Learning latent actions from large-scale videos is crucial for the pre-training of scalable embodied foundation models, yet existing methods often struggle with action-irrelevant distractors. Although incorporating action supervision can alleviate these distractions, its effectiveness is restricted by the scarcity of available action labels. Optical flow represents pixel-level motion between consecutive frames, naturally suppressing background elements and emphasizing moving objects. Motivated by this, we propose robust Latent Action learning with Optical Flow constraints, called LAOF, a pseudo-supervised framework that leverages the agent's optical flow as an action-driven signal to learn latent action representations robust to distractors. Experimental results show that the latent representations learned by LAOF outperform existing methods on downstream imitation learning and reinforcement learning tasks. This superior performance arises from optical flow constraints, which substantially stabilize training and improve the quality of latent representations under extremely label-scarce conditions, while remaining effective as the proportion of action labels increases to 10 percent. Importantly, even without action supervision, LAOF matches or surpasses action-supervised methods trained with 1 percent of action labels.
翻译:从大规模视频中学习潜在动作对于可扩展具身基础模型的预训练至关重要,然而现有方法常受动作无关干扰物的影响。尽管引入动作监督可缓解此类干扰,但其效果受限于可用动作标签的稀缺性。光流表征连续帧间的像素级运动,天然抑制背景元素并突出运动对象。受此启发,我们提出基于光流约束的鲁棒潜在动作学习方法(LAOF),该伪监督框架利用智能体的光流作为动作驱动信号,学习对干扰物具有鲁棒性的潜在动作表征。实验结果表明,LAOF学习的潜在表征在下游模仿学习与强化学习任务中优于现有方法。这种优越性能源于光流约束,其在极低标签条件下显著稳定训练并提升潜在表征质量,同时当动作标签比例增至10%时仍保持有效性。值得注意的是,即使在没有动作监督的情况下,LAOF仍能达到或超越使用1%动作标签训练的动作监督方法。