Recent foundation models for tabular data achieve strong task-specific performance via in-context learning. Nevertheless, they focus on direct prediction by encapsulating both representation learning and task-specific inference inside a single, resource-intensive network. This work specifically focuses on representation learning, i.e., on transferable, task-agnostic embeddings. We systematically evaluate task-agnostic representations from tabular foundation models (TabPFN and TabICL) alongside with classical feature engineering (TableVectorizer) across a variety of application tasks as outlier detection (ADBench) and supervised learning (TabArena Lite). We find that simple TableVectorizer features achieve comparable or superior performance while being up to three orders of magnitude faster than tabular foundation models. The code is available at https://github.com/ContactSoftwareAI/TabEmbedBench.
翻译:近期针对表格数据的基础模型通过上下文学习实现了强大的任务特定性能。然而,这些模型侧重于直接预测,将表征学习和任务特定推理封装在单一且资源密集的网络中。本研究专门聚焦于表征学习,即研究可迁移的、任务无关的嵌入表示。我们系统评估了来自表格基础模型(TabPFN 和 TabICL)的任务无关表征,并与经典特征工程方法(TableVectorizer)在多种应用任务中进行了比较,包括异常检测(ADBench)和监督学习(TabArena Lite)。研究发现,简单的 TableVectorizer 特征在性能上达到相当或更优水平,同时其计算速度比表格基础模型快达三个数量级。代码可在 https://github.com/ContactSoftwareAI/TabEmbedBench 获取。