与多个培训前任务一起进行听力和视觉代表学习 (Sound and Visual Representation Learning with Multiple Pretraining Tasks)

Different self-supervised tasks (SSL) reveal different features from the data. The learned feature representations can exhibit different performance for each downstream task. In this light, this work aims to combine Multiple SSL tasks (Multi-SSL) that generalizes well for all downstream tasks. Specifically, for this study, we investigate binaural sounds and image data in isolation. For binaural sounds, we propose three SSL tasks namely, spatial alignment, temporal synchronization of foreground objects and binaural audio and temporal gap prediction. We investigate several approaches of Multi-SSL and give insights into the downstream task performance on video retrieval, spatial sound super resolution, and semantic prediction on the OmniAudio dataset. Our experiments on binaural sound representations demonstrate that Multi-SSL via incremental learning (IL) of SSL tasks outperforms single SSL task models and fully supervised models in the downstream task performance. As a check of applicability on other modality, we also formulate our Multi-SSL models for image representation learning and we use the recently proposed SSL tasks, MoCov2 and DenseCL. Here, Multi-SSL surpasses recent methods such as MoCov2, DenseCL and DetCo by 2.06%, 3.27% and 1.19% on VOC07 classification and +2.83, +1.56 and +1.61 AP on COCO detection. Code will be made publicly available.

翻译：不同的自我监督任务(SSL) 显示与数据的不同特征。学习到的特征显示可以显示每个下游任务的不同性能。从这个角度看, 这项工作旨在将综合所有下游任务的多种 SSL (Multi- SSL) 任务(Multi- SSL) 结合起来, 具体地说, 我们孤立地调查二进制声音和图像数据。对于二进制声音, 我们提出三项SSL 任务, 即: 空间对齐、前景物体的时间同步以及双进制音频和时间差距预测。我们调查多种SSL 方法, 并深入了解关于视频检索、空间声音超级分辨率和 OmniAudio 数据集的语义预测等下游任务。我们的双进制声音演示实验显示, 通过SLSL的递增学习(IL) 任务超越了单一 SSL任务模式和下游任务业绩中完全受监督的模式。作为对其他模式适用性的检查, 我们还将多SSL 模型用于图像教学, 我们最近提出的 SL任务、 MOCov2和 DenCL 3+CL2 的最近使用的方法, 多- CL% 。