We consider the problem of identifying the best arm in a multi-armed bandit model. Despite a wealth of literature in the traditional fixed budget and fixed confidence regimes of the best arm identification problem, it still remains a mystery to most practitioners as to how to choose an approach and corresponding budget or confidence parameter. We propose a new formalism to avoid this dilemma altogether by minimizing a risk functional which explicitly balances the performance of the recommended arm and the cost incurred by learning this arm. In this framework, a cost is incurred for each observation during the sampling phase, and upon recommending an arm, a performance penalty is incurred for identifying a suboptimal arm. The learner's goal is to minimize the sum of the penalty and cost. This new regime mirrors the priorities of many practitioners, e.g. maximizing profit in an A/B testing framework, better than classical fixed budget or confidence settings. We derive theoretical lower bounds for the risk of each of two choices for the performance penalty, the probability of misidentification and the simple regret, and propose an algorithm called DBCARE to match these lower bounds up to polylog factors on nearly all problem instances. We then demonstrate the performance of DBCARE on a number of simulated models, comparing to fixed budget and confidence algorithms to show the shortfalls of existing BAI paradigms on this problem.
翻译:我们考虑在多臂赌博机模型中识别最佳臂的问题。尽管在传统固定预算和固定置信度框架下的最佳臂识别问题已有大量文献,但对于大多数实践者而言,如何选择方法及相应的预算或置信度参数仍是一个难题。我们提出一种新的形式化框架,通过最小化一个明确权衡推荐臂性能与学习该臂所产生成本的风险泛函,从而完全避免这一困境。在此框架中,采样阶段的每次观测都会产生成本,而在推荐臂时,若识别出次优臂则会招致性能惩罚。学习者的目标是最小化惩罚与成本的总和。这一新框架比经典的固定预算或置信度设定更能反映许多实践者的优先目标,例如在A/B测试框架中最大化利润。我们针对两种性能惩罚选择(误识别概率和简单遗憾)分别推导了风险的理论下界,并提出一种名为DBCARE的算法,在几乎所有问题实例上以多对数因子匹配这些下界。随后,我们通过多个模拟模型展示了DBCARE的性能,并与固定预算和置信度算法进行比较,揭示了现有BAI范式在此问题上的不足。