Unsupervised learning of depth and ego-motion from unlabelled monocular videos has recently drawn great attention, which avoids the use of expensive ground truth in the supervised one. It achieves this by using the photometric errors between the target view and the synthesized views from its adjacent source views as the loss. Despite significant progress, the learning still suffers from occlusion and scene dynamics. This paper shows that carefully manipulating photometric errors can tackle these difficulties better. The primary improvement is achieved by a statistical technique that can mask out the invisible or nonstationary pixels in the photometric error map and thus prevents misleading the networks. With this outlier masking approach, the depth of objects moving in the opposite direction to the camera can be estimated more accurately. To the best of our knowledge, such scenarios have not been seriously considered in the previous works, even though they pose a higher risk in applications like autonomous driving. We also propose an efficient weighted multi-scale scheme to reduce the artifacts in the predicted depth maps. Extensive experiments on the KITTI dataset show the effectiveness of the proposed approaches. The overall system achieves state-of-theart performance on both depth and ego-motion estimation.
翻译:不受监督地从未贴标签的单镜头录像中了解深度和自我感动,最近引起了人们的极大注意,避免了在受监督的录像中使用昂贵的地面真理。它通过使用目标视图与相邻来源视图的综合观点之间的光度误差,从而实现了这一点。尽管取得了重大进步,但学习仍然受到隔离和场景动态的影响。本文表明,仔细处理光度误差可以更好地解决这些困难。主要改进是通过统计技术实现的,这种技术可以掩盖光度误差图中的无形或非静止的像素,从而防止误导网络。采用这种比外层遮盖方法,可以更准确地估计向镜头相反方向移动的物体的深度。据我们所知,在以往的工程中,这种假设没有受到认真考虑,尽管它们在诸如自主驾驶等应用中构成更大的风险。我们还提议了一个有效的加权多尺度计划,以减少预测深度地图中的文物。KITTI数据集的大规模实验显示了拟议方法的有效性。整个系统在深度和自我感官的两种方面都实现了状态的性能。