Human-level contact-rich manipulation relies on the distinct roles of two key modalities: vision provides spatially rich but temporally slow global context, while force sensing captures rapid, high-frequency local contact dynamics. Integrating these signals is challenging due to their fundamental frequency and informational disparities. In this work, we propose ImplicitRDP, a unified end-to-end visual-force diffusion policy that integrates visual planning and reactive force control within a single network. We introduce Structural Slow-Fast Learning, a mechanism utilizing causal attention to simultaneously process asynchronous visual and force tokens, allowing the policy to perform closed-loop adjustments at the force frequency while maintaining the temporal coherence of action chunks. Furthermore, to mitigate modality collapse where end-to-end models fail to adjust the weights across different modalities, we propose Virtual-target-based Representation Regularization. This auxiliary objective maps force feedback into the same space as the action, providing a stronger, physics-grounded learning signal than raw force prediction. Extensive experiments on contact-rich tasks demonstrate that ImplicitRDP significantly outperforms both vision-only and hierarchical baselines, achieving superior reactivity and success rates with a streamlined training pipeline. Code and videos will be publicly available at https://implicit-rdp.github.io.
翻译:人类水平的接触丰富操作依赖于两种关键模态的独特作用:视觉提供空间丰富但时间上缓慢的全局上下文,而力觉感知捕捉快速、高频的局部接触动态。由于这两种信号在频率和信息特性上存在本质差异,其融合具有挑战性。本研究提出ImplicitRDP,一种统一的端到端视觉-力觉扩散策略,将视觉规划与反应式力控制集成于单一网络中。我们引入结构慢快学习机制,利用因果注意力同时处理异步的视觉与力觉令牌,使策略能够在力觉频率下进行闭环调整,同时保持动作片段的时间连贯性。此外,为缓解端到端模型中不同模态权重调整失效的模态塌缩问题,我们提出基于虚拟目标的表示正则化方法。该辅助目标将力反馈映射到与动作相同的空间,提供比原始力预测更强、基于物理原理的学习信号。在接触丰富任务上的大量实验表明,ImplicitRDP显著优于纯视觉方法和分层基线方法,通过简化的训练流程实现了更优的反应性和成功率。代码与视频将在 https://implicit-rdp.github.io 公开。