Information theory is an outstanding framework to measure uncertainty, dependence and relevance in data and systems. It has several desirable properties for real world applications: it naturally deals with multivariate data, it can handle heterogeneous data types, and the measures can be interpreted in physical units. However, it has not been adopted by a wider audience because obtaining information from multidimensional data is a challenging problem due to the curse of dimensionality. Here we propose an indirect way of computing information based on a multivariate Gaussianization transform. Our proposal mitigates the difficulty of multivariate density estimation by reducing it to a composition of tractable (marginal) operations and simple linear transformations, which can be interpreted as a particular deep neural network. We introduce specific Gaussianization-based methodologies to estimate total correlation, entropy, mutual information and Kullback-Leibler divergence. We compare them to recent estimators showing the accuracy on synthetic data generated from different multivariate distributions. We made the tools and datasets publicly available to provide a test-bed to analyze future methodologies. Results show that our proposal is superior to previous estimators particularly in high-dimensional scenarios; and that it leads to interesting insights in neuroscience, geoscience, computer vision, and machine learning.
翻译:信息理论是测量数据和系统不确定性、依赖性和相关性的杰出框架。 它对于现实世界应用具有若干可取的特性: 它自然涉及多变量数据, 它可以处理不同的数据类型, 并且可以对物理单元进行解释。 但是, 它还没有被更广泛的受众采纳, 因为从多层面数据获取信息是一个挑战性的问题, 原因是维度的诅咒。 我们在这里提出了一个间接计算信息的方法, 以多变量高山化变换为基础。 我们的建议通过将多变量密度估计降低为可移动( 边际) 操作和简单线性变换的构成来减轻多变量密度估计的难度, 这些变换可被解释为特定的深线性神经网络。 我们引入了基于高星化的具体方法, 来估计总相关性、 加密、 相互信息 和 Kullback- Leiber 差异。 我们把它们与最近显示不同多变量分布所生成合成数据准确性的估算器进行了比较。 我们提供了工具和数据集, 以提供一个测试台, 来分析未来方法。 结果显示, 我们的建议比前几代的、 高空洞度和高度的图像, 以及高度的图像学学习过程更优于前, 。