Embedding spaces are fundamental to modern AI, translating raw data into high-dimensional vectors that encode rich semantic relationships. Yet, their internal structures remain opaque, with existing approaches often sacrificing semantic coherence for structural regularity or incurring high computational overhead to improve interpretability. To address these challenges, we introduce the Semantic Field Subspace (SFS), a geometry-preserving, context-aware representation that captures local semantic neighborhoods within the embedding space. We also propose SAFARI (SemAntic Field subspAce deteRmInation), an unsupervised, modality-agnostic algorithm that uncovers hierarchical semantic structures using a novel metric called Semantic Shift, which quantifies how semantics evolve as SFSes evolve. To ensure scalability, we develop an efficient approximation of Semantic Shift that replaces costly SVD computations, achieving a 15~30x speedup with average errors below 0.01. Extensive evaluations across six real-world text and image datasets show that SFSes outperform standard classifiers not only in classification but also in nuanced tasks such as political bias detection, while SAFARI consistently reveals interpretable and generalizable semantic hierarchies. This work presents a unified framework for structuring, analyzing, and scaling semantic understanding in embedding spaces.
翻译:嵌入空间是现代人工智能的基础,它将原始数据转换为高维向量,编码丰富的语义关系。然而,其内部结构仍不透明,现有方法常为追求结构规整性而牺牲语义连贯性,或为提升可解释性而承受高计算开销。为解决这些挑战,我们引入了语义场子空间(SFS),这是一种保持几何特性、上下文感知的表征方法,能够捕捉嵌入空间内的局部语义邻域。我们还提出了SAFARI(语义场子空间确定)算法,这是一种无监督、模态无关的算法,利用一种称为语义偏移的新度量来揭示层次化语义结构,该度量量化了SFS演化过程中语义的演变。为确保可扩展性,我们开发了语义偏移的高效近似方法,替代了计算成本高昂的奇异值分解,实现了15~30倍的加速,平均误差低于0.01。在六个真实世界文本和图像数据集上的广泛评估表明,SFS不仅在分类任务上优于标准分类器,在政治偏见检测等精细任务中也表现更佳,而SAFARI始终能揭示可解释且可泛化的语义层次结构。这项工作为嵌入空间中的语义理解提供了一个结构化、分析和扩展的统一框架。