Conventional works that learn grasping affordance from demonstrations need to explicitly predict grasping configurations, such as gripper approaching angles or grasping preshapes. Classic motion planners could then sample trajectories by using such predicted configurations. In this work, our goal is instead to fill the gap between affordance discovery and affordance-based policy learning by integrating the two objectives in an end-to-end imitation learning framework based on deep neural networks. From a psychological perspective, there is a close association between attention and affordance. Therefore, with an end-to-end neural network, we propose to learn affordance cues as visual attention that serves as a useful indicating signal of how a demonstrator accomplishes tasks, instead of explicitly modeling affordances. To achieve this, we propose a contrastive learning framework that consists of a Siamese encoder and a trajectory decoder. We further introduce a coupled triplet loss to encourage the discovered affordance cues to be more affordance-relevant. Our experimental results demonstrate that our model with the coupled triplet loss achieves the highest grasping success rate in a simulated robot environment.
翻译:常规工程从演示中学会从展示中获取发价。 常规工程从演示中学会从获取发价, 需要明确预测捕捉器的配置。 经典运动规划者可以通过使用这种预测配置对轨迹进行取样。 在这项工作中, 我们的目标是通过在深层神经网络的基础上将这两项目标纳入端到端的模拟学习框架来填补发价发现与发价政策学习之间的差距。 从心理角度看, 注意力和发价之间有着密切的联系。 因此, 我们建议通过终端到端神经网络, 学习发价提示作为视觉关注, 以显示演示者如何完成任务的有用信号, 而不是明确的建模。 为了实现这一目标, 我们提议了一个对比式学习框架, 包括一个暹米的编码器和一个轨迹解码。 我们进一步引入了三重损失, 以鼓励发现的发价提示更具有发价相关性。 我们的实验结果显示, 与三重损失同时出现的模型在模拟机器人环境中取得了最高的成功率。