In this paper, we explore learning end-to-end deep neural trackers without tracking annotations. This is important as large-scale training data is essential for training deep neural trackers while tracking annotations are expensive to acquire. In place of tracking annotations, we first hallucinate videos from images with bounding box annotations using zoom-in/out motion transformations to obtain free tracking labels. We add video simulation augmentations to create a diverse tracking dataset, albeit with simple motion. Next, to tackle harder tracking cases, we mine hard examples across an unlabeled pool of real videos with a tracker trained on our hallucinated video data. For hard example mining, we propose an optimization-based connecting process to first identify and then rectify hard examples from the pool of unlabeled videos. Finally, we train our tracker jointly on hallucinated data and mined hard video examples. Our weakly supervised tracker achieves state-of-the-art performance on the MOT17 and TAO-person datasets. On MOT17, we further demonstrate that the combination of our self-generated data and the existing manually-annotated data leads to additional improvements.
翻译:在本文中,我们探索如何在不跟踪说明的情况下学习端到端深神经跟踪器。 这很重要, 因为大型培训数据对于培训深神经跟踪器至关重要, 而跟踪说明则非常昂贵。 作为跟踪说明的替代, 我们首先使用缩进/ 出动转换来获取自由跟踪标签, 使用带框说明的图像中产生幻觉的视频。 我们添加视频模拟增强器来创建多样化的跟踪数据集, 尽管使用简单的动作 。 其次, 为了处理更困难的跟踪案例, 我们将在一个没有标签的真实视频库中埋设一些硬性实例, 并配有受过关于我们幻影视频数据培训的跟踪器。 例如, 我们提议一个基于优化的连接程序, 以便首先识别并纠正来自未贴标签的视频库中的硬性实例。 最后, 我们共同培训我们的跟踪器, 使用有软化的数据和摄像的硬性视频实例。 我们受监管的跟踪器在MOT17 和 TAO- 人数据集上取得了最先进的表现。 在 MOT17 上, 我们进一步展示了我们的自我生成的数据和现有手动附加说明数据的组合, 导致进一步的改进。