Analyzing hand-object interaction in egocentric vision facilitates VR/AR applications and human-robot policy transfer. Existing research has mostly focused on modeling the behavior paradigm of interactive actions (i.e., "how to interact"). However, the more challenging and fine-grained problem of capturing the critical moments of contact and separation between the hand and the target object (i.e., "when to interact") is still underexplored, which is crucial for immersive interactive experiences in mixed reality and robotic motion planning. Therefore, we formulate this problem as temporal interaction localization (TIL). Some recent works extract semantic masks as TIL references, but suffer from inaccurate object grounding and cluttered scenarios. Although current temporal action localization (TAL) methods perform well in detecting verb-noun action segments, they rely on category annotations during training and exhibit limited precision in localizing hand-object contact/separation moments. To address these issues, we propose a novel zero-shot approach dubbed EgoLoc to localize hand-object contact and separation timestamps in egocentric videos. EgoLoc introduces hand-dynamics-guided sampling to generate high-quality visual prompts. It exploits the vision-language model to identify contact/separation attributes, localize specific timestamps, and provide closed-loop feedback for further refinement. EgoLoc eliminates the need for object masks and verb-noun taxonomies, leading to generalizable zero-shot implementation. Comprehensive experiments on the public dataset and our novel benchmarks demonstrate that EgoLoc achieves plausible TIL for egocentric videos. It is also validated to effectively facilitate multiple downstream applications in egocentric vision and robotic manipulation tasks. Code and relevant data will be released at https://github.com/IRMVLab/EgoLoc.
翻译:在第一人称视觉中分析手-物交互有助于VR/AR应用及人-机器人策略迁移。现有研究多集中于建模交互动作的行为范式(即“如何交互”)。然而,捕捉手部与目标物体接触和分离的关键时刻(即“何时交互”)这一更具挑战性且更精细的问题仍未得到充分探索,而这对混合现实中的沉浸式交互体验及机器人运动规划至关重要。因此,我们将该问题形式化为时序交互定位(TIL)。近期部分研究通过提取语义掩码作为TIL参考,但受限于物体定位不准确及场景杂乱。尽管当前时序动作定位(TAL)方法在检测动词-名词动作片段方面表现良好,但其训练过程依赖类别标注,且在定位手-物接触/分离时刻时精度有限。为解决这些问题,我们提出一种名为EgoLoc的新型零样本方法,用于在第一人称视频中定位手-物接触与分离的时间戳。EgoLoc引入手部动力学引导采样以生成高质量视觉提示,利用视觉-语言模型识别接触/分离属性、定位特定时间戳,并提供闭环反馈以进行进一步优化。该方法无需物体掩码和动词-名词分类体系,实现了可泛化的零样本实施。在公开数据集及我们构建的新基准上的综合实验表明,EgoLoc能在第一人称视频中实现合理的TIL。该方法亦被验证可有效促进第一人称视觉与机器人操作任务中的多种下游应用。代码及相关数据将在https://github.com/IRMVLab/EgoLoc发布。