Feature generation can significantly enhance learning outcomes, particularly for tasks with limited data. An effective way to improve feature generation is to expand the current feature space using existing features and enriching the informational content. However, generating new, interpretable features usually requires domain-specific knowledge on top of the existing features. In this paper, we introduce a Retrieval-Augmented Feature Generation method, RAFG, to generate useful and explainable features specific to domain classification tasks. To increase the interpretability of the generated features, we conduct knowledge retrieval among the existing features in the domain to identify potential feature associations. These associations are expected to help generate useful features. Moreover, we develop a framework based on large language models (LLMs) for feature generation with reasoning to verify the quality of the features during their generation process. Experiments across several datasets in medical, economic, and geographic domains show that our RAFG method can produce high-quality, meaningful features and significantly improve classification performance compared with baseline methods.
翻译:特征生成能够显著提升学习效果,尤其是在数据有限的任务中。改进特征生成的一种有效方法是利用现有特征扩展当前特征空间并丰富其信息内容。然而,生成新的、可解释的特征通常需要在现有特征基础上具备领域特定知识。本文提出了一种检索增强的特征生成方法RAFG,用于生成针对领域分类任务的有用且可解释的特征。为提高生成特征的可解释性,我们在领域内现有特征之间进行知识检索,以识别潜在的特征关联。这些关联有望帮助生成有用的特征。此外,我们开发了一个基于大语言模型(LLMs)的框架,通过推理进行特征生成,并在生成过程中验证特征质量。在医学、经济学和地理学等多个数据集上的实验表明,与基线方法相比,我们的RAFG方法能够生成高质量、有意义特征,并显著提升分类性能。