Video Salient Document Detection (VSDD) is an essential task of practical computer vision, which aims to highlight visually salient document regions in video frames. Previous techniques for VSDD focus on learning features without considering the cooperation among and across the appearance and motion cues and thus fail to perform in practical scenarios. Moreover, most of the previous techniques demand high computational resources, which limits the usage of such systems in resource-constrained settings. To handle these issues, we propose VS-Net, which captures multi-scale spatiotemporal information with the help of dilated depth-wise separable convolution and Approximation Rank Pooling. VS-Net extracts the key features locally from each frame across embedding sub-spaces and forwards the features between adjacent and parallel nodes, enhancing model performance globally. Our model generates saliency maps considering both the background and foreground simultaneously, making it perform better in challenging scenarios. The immense experiments regulated on the benchmark MIDV-500 dataset show that the VS-Net model outperforms state-of-the-art approaches in both time and robustness measures.
翻译:VSDD以往的技术侧重于学习特征,而没有考虑到在外观和运动提示之间和跨端的合作,因而无法在实际情景中发挥作用。此外,大多数先前的技术都需要高的计算资源,这限制了这种系统在资源紧缺的环境中的使用。为了处理这些问题,我们提议VS-Net,在扩展深度、可分解的相容和相近级汇合系统的帮助下,捕捉多尺度的波段信息。VS-Net从每个框架中提取关键特征,通过嵌入子空间和平行节点之间的嵌入和推进特征,在全球提升模型性能。我们的模型生成了突出的地图,既考虑到背景,又考虑到地面,同时在具有挑战性的情景中表现更好。在MIDV-500基准数据集上进行的大量实验显示,VS-Net模型在时间和稳健度测量中都超越了状态。