文本与图像深度表征中语义信息的定量分析 (A quantitative analysis of semantic information in deep representations of text and images)

Deep neural networks are known to develop similar representations for semantically related data, even when they belong to different domains, such as an image and its description, or the same text in different languages. We present a method for quantitatively investigating this phenomenon by measuring the relative information content of the representations of semantically related data and probing how it is encoded into multiple tokens of large language models (LLMs) and vision transformers. Looking first at how LLMs process pairs of translated sentences, we identify inner ``semantic'' layers containing the most language-transferable information. We find moreover that, on these layers, a larger LLM (DeepSeek-V3) extracts significantly more general information than a smaller one (Llama3.1-8B). Semantic information of English text is spread across many tokens and it is characterized by long-distance correlations between tokens and by a causal left-to-right (i.e., past-future) asymmetry. We also identify layers encoding semantic information within visual transformers. We show that caption representations in the semantic layers of LLMs predict visual representations of the corresponding images. We observe significant and model-dependent information asymmetries between image and text representations.

翻译：深度神经网络已知会对语义相关的数据形成相似的表征，即使这些数据属于不同领域，例如图像及其描述，或不同语言中的同一文本。我们提出一种定量研究该现象的方法，通过测量语义相关数据表征的相对信息量，并探究其如何编码至大语言模型（LLMs）和视觉Transformer的多个标记中。首先考察LLMs如何处理翻译句子对，我们识别出包含最具语言可迁移信息的内部“语义”层。此外我们发现，在这些层上，较大规模的LLM（DeepSeek-V3）比较小规模的模型（Llama3.1-8B）提取出显著更多的通用信息。英文文本的语义信息分布在众多标记中，其特征表现为标记间的长程相关性以及因果性的从左到右（即过去-未来）不对称性。我们还识别了视觉Transformer中编码语义信息的层。研究表明，LLMs语义层中的标题表征能够预测对应图像的视觉表征。我们观察到图像与文本表征之间存在显著且模型依赖的信息不对称性。