Real-time monocular 3D object detection remains challenging due to severe depth ambiguity, viewpoint shifts, and the high computational cost of 3D reasoning. Existing approaches either rely on LiDAR or geometric priors to compensate for missing depth, or sacrifice efficiency to achieve competitive accuracy. We introduce LeAD-M3D, a monocular 3D detector that achieves state-of-the-art accuracy and real-time inference without extra modalities. Our method is powered by three key components. Asymmetric Augmentation Denoising Distillation (A2D2) transfers geometric knowledge from a clean-image teacher to a mixup-noised student via a quality- and importance-weighted depth-feature loss, enabling stronger depth reasoning without LiDAR supervision. 3D-aware Consistent Matching (CM3D) improves prediction-to-ground truth assignment by integrating 3D MGIoU into the matching score, yielding more stable and precise supervision. Finally, Confidence-Gated 3D Inference (CGI3D) accelerates detection by restricting expensive 3D regression to top-confidence regions. Together, these components set a new Pareto frontier for monocular 3D detection: LeAD-M3D achieves state-of-the-art accuracy on KITTI and Waymo, and the best reported car AP on Rope3D, while running up to 3.6x faster than prior high-accuracy methods. Our results demonstrate that high fidelity and real-time efficiency in monocular 3D detection are simultaneously attainable - without LiDAR, stereo, or geometric assumptions.
翻译:由于严重的深度模糊性、视角变化以及三维推理的高计算成本,实时单目三维目标检测仍面临挑战。现有方法要么依赖激光雷达或几何先验来补偿缺失的深度信息,要么牺牲效率以换取有竞争力的精度。本文提出LeAD-M3D,一种无需额外模态即可实现最先进精度和实时推理的单目三维检测器。该方法基于三个核心组件:非对称增强去噪蒸馏(A2D2)通过质量与重要性加权的深度特征损失,将干净图像教师模型的几何知识迁移至混合噪声学生模型,从而在没有激光雷达监督的情况下实现更强的深度推理能力;三维感知一致性匹配(CM3D)通过将三维MGIoU整合至匹配分数中,改进了预测与真实标注的分配机制,提供更稳定精确的监督信号;置信度门控三维推理(CGI3D)通过将昂贵的三维回归计算限制在高置信度区域,显著加速检测过程。这些组件共同为单目三维检测建立了新的帕累托前沿:LeAD-M3D在KITTI和Waymo数据集上达到最先进精度,在Rope3D数据集上取得目前最优的车辆平均精度指标,同时推理速度比先前高精度方法提升最高达3.6倍。实验结果表明,单目三维检测在无需激光雷达、立体视觉或几何假设的前提下,能够同时实现高精度与实时效率。