Monocular height estimation plays a critical role in 3D perception for remote sensing, offering a cost-effective alternative to multi-view or LiDAR-based methods. While deep learning has significantly advanced the capabilities of monocular height estimation, these methods remain fundamentally limited by the availability of labeled data, which are expensive and labor-intensive to obtain at scale. The scarcity of high-quality annotations hinders the generalization and performance of existing models. To overcome this limitation, we propose leveraging large volumes of unlabeled data through a semi-supervised learning framework, enabling the model to extract informative cues from unlabeled samples and improve its predictive performance. In this work, we introduce TSE-Net, a self-training pipeline for semi-supervised monocular height estimation. The pipeline integrates teacher, student, and exam networks. The student network is trained on unlabeled data using pseudo-labels generated by the teacher network, while the exam network functions as a temporal ensemble of the student network to stabilize performance. The teacher network is formulated as a joint regression and classification model: the regression branch predicts height values that serve as pseudo-labels, and the classification branch predicts height value classes along with class probabilities, which are used to filter pseudo-labels. Height value classes are defined using a hierarchical bi-cut strategy to address the inherent long-tailed distribution of heights, and the predicted class probabilities are calibrated with a Plackett-Luce model to reflect the expected accuracy of pseudo-labels. We evaluate the proposed pipeline on three datasets spanning different resolutions and imaging modalities. Codes are available at https://github.com/zhu-xlab/tse-net.
翻译:单目高度估计在遥感三维感知中具有关键作用,为多视角或激光雷达方法提供了一种经济高效的替代方案。尽管深度学习显著提升了单目高度估计的能力,但这些方法本质上仍受限于标注数据的可用性——大规模获取此类数据成本高昂且劳动密集。高质量标注的稀缺性制约了现有模型的泛化能力和性能。为克服这一局限,我们提出通过半监督学习框架利用大量未标注数据,使模型能够从未标注样本中提取信息线索并提升预测性能。本文提出TSE-Net,一种用于半监督单目高度估计的自训练流程。该流程整合了教师网络、学生网络和考核网络:学生网络使用教师网络生成的伪标签在未标注数据上进行训练,考核网络作为学生网络的时间集成以稳定性能。教师网络构建为联合回归与分类模型:回归分支预测作为伪标签的高度值,分类分支预测高度值类别及类别概率以过滤伪标签。通过分层双切分策略定义高度值类别以应对高度固有的长尾分布,并采用Plackett-Luce模型校准预测的类别概率以反映伪标签的预期准确性。我们在涵盖不同分辨率与成像模式的三个数据集上评估了所提流程。代码发布于https://github.com/zhu-xlab/tse-net。