Recent advances in 3D point cloud transformers have led to state-of-the-art results in tasks such as semantic segmentation and reconstruction. However, these models typically rely on dense token representations, incurring high computational and memory costs during training and inference. In this work, we present the finding that tokens are remarkably redundant, leading to substantial inefficiency. We introduce gitmerge3D, a globally informed graph token merging method that can reduce the token count by up to 90-95% while maintaining competitive performance. This finding challenges the prevailing assumption that more tokens inherently yield better performance and highlights that many current models are over-tokenized and under-optimized for scalability. We validate our method across multiple 3D vision tasks and show consistent improvements in computational efficiency. This work is the first to assess redundancy in large-scale 3D transformer models, providing insights into the development of more efficient 3D foundation architectures. Our code and checkpoints are publicly available at https://gitmerge3d.github.io
翻译:近期三维点云Transformer的进展在语义分割与重建等任务中取得了最先进的结果。然而,这些模型通常依赖于密集的令牌表示,导致训练与推理过程中产生高昂的计算与内存成本。本研究发现令牌存在显著冗余,导致严重的效率低下。我们提出gitmerge3D——一种全局感知的图令牌融合方法,可将令牌数量减少90-95%的同时保持竞争力性能。这一发现挑战了‘更多令牌必然带来更好性能’的主流假设,并揭示当前许多模型存在过度令牌化且可扩展性优化不足的问题。我们在多个三维视觉任务中验证了该方法,并展示了计算效率的持续提升。本研究首次评估了大规模三维Transformer模型的冗余性,为开发更高效的三维基础架构提供了新视角。代码与模型检查点已公开于https://gitmerge3d.github.io