弥合用于多式融合3D天体探测的雷达和相机特点的视界差异 (Bridging the View Disparity of Radar and Camera Features for Multi-modal Fusion 3D Object Detection)

Environmental perception with multi-modal fusion of radar and camera is crucial in autonomous driving to increase the accuracy, completeness, and robustness. This paper focuses on how to utilize millimeter-wave (MMW) radar and camera sensor fusion for 3D object detection. A novel method which realizes the feature-level fusion under bird-eye view (BEV) for a better feature representation is proposed. Firstly, radar features are augmented with temporal accumulation and sent to a temporal-spatial encoder for radar feature extraction. Meanwhile, multi-scale image 2D features which adapt to various spatial scales are obtained by image backbone and neck model. Then, image features are transformed to BEV with the designed view transformer. In addition, this work fuses the multi-modal features with a two-stage fusion model called point fusion and ROI fusion, respectively. Finally, a detection head regresses objects category and 3D locations. Experimental results demonstrate that the proposed method realizes the state-of-the-art performance under the most important detection metrics, mean average precision (mAP) and nuScenes detection score (NDS) on the challenging nuScenes dataset.

翻译：使用多式集成雷达和照相机的环境感知,对于提高准确性、完整性和稳健度的自主驱动至关重要。本文件侧重于如何利用毫米波雷达和相机感应传感器聚合进行三维天体探测。提出了一种新颖的方法,在鸟眼观察下实现地级聚变,以更好地显示特征。首先,雷达特征随着时间累积而增加,并发送到一个时间空间摄像器中,用于雷达特征提取。与此同时,图像脊椎和颈部模型获得了适应各种空间尺度的多尺度图像2D特征。然后,图像特征转换为BEV,使用设计视图变压器。此外,这项工作将多式特征与两阶段聚变模型(分别称为点聚变和ROI聚变)结合,最后,探测头递增物体类别和3D位置。实验结果表明,拟议方法在最具挑战性的数据中实现了最高级的探测指标、平均精确度(MAP)和nu-Scenes探测分数。