Model selection is a key task in statistics, playing a critical role across various scientific disciplines. While no model can fully capture the complexities of a real-world data-generating process, identifying the model that best approximates it can provide valuable insights. Bayesian statistics offers a flexible framework for model selection by updating prior beliefs as new data becomes available, allowing for ongoing refinement of candidate models. This is typically achieved by calculating posterior probabilities, which quantify the support for each model given the observed data. However, in cases where likelihood functions are intractable, exact computation of these posterior probabilities becomes infeasible. Approximate Bayesian computation (ABC) has emerged as a likelihood-free method and it is traditionally used with summary statistics to reduce data dimensionality, however this often results in information loss difficult to quantify, particularly in model selection contexts. Recent advancements propose the use of full data approaches based on statistical distances, offering a promising alternative that bypasses the need for handcrafted summary statistics and can yield posterior approximations that more closely reflect the true posterior under suitable conditions. Despite these developments, full data ABC approaches have not yet been widely applied to model selection problems. This paper seeks to address this gap by investigating the performance of ABC with statistical distances in model selection. Through simulation studies and an application to toad movement models, this work explores whether full data approaches can overcome the limitations of summary statistic-based ABC for model choice.
翻译:模型选择是统计学中的核心任务,在众多科学领域中发挥着关键作用。尽管没有任何模型能够完全捕捉现实世界数据生成过程的复杂性,但识别最能近似该过程的模型仍能提供宝贵的洞见。贝叶斯统计学为模型选择提供了一个灵活的框架,通过在新数据可用时更新先验信念,实现对候选模型的持续优化。这通常通过计算后验概率来实现,该概率量化了在观测数据支持下各模型的置信度。然而,在似然函数难以处理的情况下,精确计算这些后验概率变得不可行。近似贝叶斯计算作为一种免似然方法应运而生,传统上常与摘要统计量结合使用以降低数据维度,但这往往导致难以量化的信息损失,尤其在模型选择场景中更为突出。近年来的进展提出了基于统计距离的全数据方法,提供了一种有前景的替代方案,既避免了手工构建摘要统计量的需求,又能在适当条件下产生更接近真实后验分布的近似结果。尽管已有这些进展,全数据ABC方法尚未在模型选择问题中得到广泛应用。本文旨在通过研究基于统计距离的ABC在模型选择中的性能来填补这一空白。通过模拟研究和对蟾蜍运动模型的实际应用,本文探讨了全数据方法能否克服基于摘要统计量的ABC在模型选择中的局限性。