Concept Bottleneck Models (CBMs) have been proposed as a compromise between white-box and black-box models, aiming to achieve interpretability without sacrificing accuracy. The standard training procedure for CBMs is to predefine a candidate set of human-interpretable concepts, extract their values from the training data, and identify a sparse subset as inputs to a transparent prediction model. However, such approaches are often hampered by the tradeoff between exploring a sufficiently large set of concepts versus controlling the cost of obtaining concept extractions, resulting in a large interpretability-accuracy tradeoff. This work investigates a novel approach that sidesteps these challenges: BC-LLM iteratively searches over a potentially infinite set of concepts within a Bayesian framework, in which Large Language Models (LLMs) serve as both a concept extraction mechanism and prior. Even though LLMs can be miscalibrated and hallucinate, we prove that BC-LLM can provide rigorous statistical inference and uncertainty quantification. Across image, text, and tabular datasets, BC-LLM outperforms interpretable baselines and even black-box models in certain settings, converges more rapidly towards relevant concepts, and is more robust to out-of-distribution samples.
翻译:概念瓶颈模型(CBMs)被提出作为白盒模型与黑盒模型之间的折衷方案,旨在实现可解释性而不牺牲准确性。CBM的标准训练流程是预定义一组候选的人类可解释概念,从训练数据中提取其取值,并筛选出一个稀疏子集作为透明预测模型的输入。然而,这类方法常受限于探索足够大的概念集与控制概念提取成本之间的权衡,导致较大的可解释性-准确性权衡。本研究探索了一种规避这些挑战的新方法:BC-LLM在贝叶斯框架内迭代搜索潜在无限的概念集合,其中大语言模型(LLMs)同时作为概念提取机制和先验分布。尽管LLMs可能存在校准偏差和幻觉问题,我们证明BC-LLM能够提供严格的统计推断和不确定性量化。在图像、文本和表格数据集的实验中,BC-LLM在特定场景下优于可解释基线模型甚至黑盒模型,能更快收敛至相关概念,并对分布外样本表现出更强的鲁棒性。