Action Detection is a complex task that aims to detect and classify human actions in video clips. Typically, it has been addressed by processing fine-grained features extracted from a video classification backbone. Recently, thanks to the robustness of object and people detectors, a deeper focus has been added on relationship modelling. Following this line, we propose a graph-based framework to learn high-level interactions between people and objects, in both space and time. In our formulation, spatio-temporal relationships are learned through self-attention on a multi-layer graph structure which can connect entities from consecutive clips, thus considering long-range spatial and temporal dependencies. The proposed module is backbone independent by design and does not require end-to-end training. Extensive experiments are conducted on the AVA dataset, where our model demonstrates state-of-the-art results and consistent improvements over baselines built with different backbones. Code is publicly available at https://github.com/aimagelab/STAGE_action_detection.
翻译:行动探测是一项复杂的任务,目的是在视频剪辑中探测和分类人类行为,通常通过处理从视频分类主干网中提取的细微特征加以解决。最近,由于物体和人员探测器的坚固性,对关系建模增加了更深的侧重点。根据这条线,我们提议了一个基于图表的框架,以学习人与物体之间在空间和时间方面的高层次互动。在我们的配方中,通过一个多层图结构的自我关注来学习空间-时空关系,该图结构可以将连续的剪辑中的实体连接起来,从而考虑到远距离的空间和时间依赖性。拟议的模块是独立设计的主干线,不需要端对端到端的培训。在AVA数据集上进行了广泛的实验,我们的模型展示了最新的艺术结果,并一致改进了不同主干网的基线。代码可在https://github.com/aimagelab/STAGE_Action_dection上公开查阅。