We propose Track and Caption Any Motion (TCAM), a motion-centric framework for automatic video understanding that discovers and describes motion patterns without user queries. Understanding videos in challenging conditions like occlusion, camouflage, or rapid movement often depends more on motion dynamics than static appearance. TCAM autonomously observes a video, identifies multiple motion activities, and spatially grounds each natural language description to its corresponding trajectory through a motion-field attention mechanism. Our key insight is that motion patterns, when aligned with contrastive vision-language representations, provide powerful semantic signals for recognizing and describing actions. Through unified training that combines global video-text alignment with fine-grained spatial correspondence, TCAM enables query-free discovery of multiple motion expressions via multi-head cross-attention. On the MeViS benchmark, TCAM achieves 58.4% video-to-text retrieval, 64.9 JF for spatial grounding, and discovers 4.8 relevant expressions per video with 84.7% precision, demonstrating strong cross-task generalization.
翻译:我们提出了Track and Caption Any Motion(TCAM),一种以运动为中心的视频自动理解框架,能够在无需用户查询的情况下发现并描述运动模式。在遮挡、伪装或快速移动等挑战性条件下理解视频,往往更依赖于运动动态而非静态外观。TCAM通过运动场注意力机制,自主观察视频、识别多种运动活动,并将每个自然语言描述空间定位到其对应的运动轨迹上。我们的核心见解是:当运动模式与对比视觉-语言表征对齐时,可为动作识别与描述提供强大的语义信号。通过结合全局视频-文本对齐与细粒度空间对应性的统一训练,TCAM利用多头交叉注意力实现了无需查询的多重运动表达发现。在MeViS基准测试中,TCAM实现了58.4%的视频到文本检索准确率、64.9的JF空间定位分数,并以84.7%的精确度平均发现每个视频中4.8个相关表达,展现出强大的跨任务泛化能力。