We propose a principled method for autoencoding with random forests. Our strategy builds on foundational results from nonparametric statistics and spectral graph theory to learn a low-dimensional embedding of the model that optimally represents relationships in the data. We provide exact and approximate solutions to the decoding problem via constrained optimization, split relabeling, and nearest neighbors regression. These methods effectively invert the compression pipeline, establishing a map from the embedding space back to the input space using splits learned by the ensemble's constituent trees. The resulting decoders are universally consistent under common regularity assumptions. The procedure works with supervised or unsupervised models, providing a window into conditional or joint distributions. We demonstrate various applications of this autoencoder, including powerful new tools for visualization, compression, clustering, and denoising. Experiments illustrate the ease and utility of our method in a wide range of settings, including tabular, image, and genomic data.
翻译:我们提出了一种基于随机森林的自编码原则性方法。该策略建立在非参数统计和谱图理论的基础性成果之上,通过学习模型的一个低维嵌入,以最优方式表示数据中的关系。我们通过约束优化、分裂重标记和最近邻回归,为解码问题提供了精确和近似解。这些方法有效地反转了压缩流程,利用集成中组成树学习到的分裂,建立了从嵌入空间返回输入空间的映射。在常见的正则性假设下,所得解码器具有普遍一致性。该流程适用于监督或无监督模型,为条件分布或联合分布提供了观察窗口。我们展示了该自编码器的多种应用,包括用于可视化、压缩、聚类和去噪的强大新工具。实验结果表明,我们的方法在表格、图像和基因组数据等多种场景中均具有简便性和实用性。