Today, people can easily record memorable moments, ranging from concerts, sports events, lectures, family gatherings, and birthday parties with multiple consumer cameras. However, synchronizing these cross-camera streams remains challenging. Existing methods assume controlled settings, specific targets, manual correction, or costly hardware. We present VisualSync, an optimization framework based on multi-view dynamics that aligns unposed, unsynchronized videos at millisecond accuracy. Our key insight is that any moving 3D point, when co-visible in two cameras, obeys epipolar constraints once properly synchronized. To exploit this, VisualSync leverages off-the-shelf 3D reconstruction, feature matching, and dense tracking to extract tracklets, relative poses, and cross-view correspondences. It then jointly minimizes the epipolar error to estimate each camera's time offset. Experiments on four diverse, challenging datasets show that VisualSync outperforms baseline methods, achieving an median synchronization error below 50 ms.
翻译:如今,人们可以轻松使用多台消费级相机记录从音乐会、体育赛事、讲座、家庭聚会到生日派对等难忘时刻。然而,同步这些跨相机视频流仍然具有挑战性。现有方法通常假设受控环境、特定目标、手动校正或依赖昂贵硬件。我们提出了VisualSync,一种基于多视角动力学的优化框架,能够以毫秒级精度对齐未标定、未同步的视频。我们的核心洞见是:任何在三维空间中移动的点,若在两个相机中共同可见,一旦正确同步,便满足极线约束。为利用这一点,VisualSync采用现成的三维重建、特征匹配与密集跟踪技术,提取轨迹片段、相对位姿及跨视角对应关系,进而通过联合最小化极线误差来估计各相机的时间偏移。在四个多样化且具挑战性的数据集上的实验表明,VisualSync优于基线方法,实现了中位同步误差低于50毫秒的性能。