Panorama has a full FoV (360$^\circ\times$180$^\circ$), offering a more complete visual description than perspective images. Thanks to this characteristic, panoramic depth estimation is gaining increasing traction in 3D vision. However, due to the scarcity of panoramic data, previous methods are often restricted to in-domain settings, leading to poor zero-shot generalization. Furthermore, due to the spherical distortions inherent in panoramas, many approaches rely on perspective splitting (e.g., cubemaps), which leads to suboptimal efficiency. To address these challenges, we propose $\textbf{DA}$$^{\textbf{2}}$: $\textbf{D}$epth $\textbf{A}$nything in $\textbf{A}$ny $\textbf{D}$irection, an accurate, zero-shot generalizable, and fully end-to-end panoramic depth estimator. Specifically, for scaling up panoramic data, we introduce a data curation engine for generating high-quality panoramic depth data from perspective, and create $\sim$543K panoramic RGB-depth pairs, bringing the total to $\sim$607K. To further mitigate the spherical distortions, we present SphereViT, which explicitly leverages spherical coordinates to enforce the spherical geometric consistency in panoramic image features, yielding improved performance. A comprehensive benchmark on multiple datasets clearly demonstrates DA$^{2}$'s SoTA performance, with an average 38% improvement on AbsRel over the strongest zero-shot baseline. Surprisingly, DA$^{2}$ even outperforms prior in-domain methods, highlighting its superior zero-shot generalization. Moreover, as an end-to-end solution, DA$^{2}$ exhibits much higher efficiency over fusion-based approaches. Both the code and the curated panoramic data has be released. Project page: https://depth-any-in-any-dir.github.io/.
翻译:全景图像拥有完整的视场(360$^\circ\\times$180$^\circ$),相比透视图像能提供更全面的视觉描述。得益于这一特性,全景深度估计在三维视觉领域正受到越来越多的关注。然而,由于全景数据的稀缺,以往方法通常局限于域内设置,导致零样本泛化能力较差。此外,由于全景图像固有的球面畸变,许多方法依赖透视分割(如立方体贴图),这导致了次优的效率。为应对这些挑战,我们提出$\\textbf{DA}$$^{\\textbf{2}}$:$\\textbf{D}$epth $\\textbf{A}$nything in $\\textbf{A}$ny $\\textbf{D}$irection,一种准确、零样本可泛化且完全端到端的全景深度估计器。具体而言,为扩展全景数据规模,我们引入了一个数据整理引擎,用于从透视图像生成高质量的全景深度数据,并创建了约543K对全景RGB-深度数据,使总量达到约607K。为进一步缓解球面畸变,我们提出了SphereViT,它显式利用球面坐标在全景图像特征中强制保持球面几何一致性,从而提升了性能。在多个数据集上的综合基准测试明确展示了DA$^{2}$的先进性能,其AbsRel指标平均比最强的零样本基线提升了38%。令人惊讶的是,DA$^{2}$甚至超越了先前的域内方法,凸显了其卓越的零样本泛化能力。此外,作为一个端到端解决方案,DA$^{2}$相比基于融合的方法展现出更高的效率。代码及整理的全景数据均已开源。项目页面:https://depth-any-in-any-dir.github.io/。