广义线性赌博机：单次更新实现近乎最优的遗憾界 (Generalized Linear Bandits: Almost Optimal Regret with One-Pass Update)

We study the generalized linear bandit (GLB) problem, a contextual multi-armed bandit framework that extends the classical linear model by incorporating a non-linear link function, thereby modeling a broad class of reward distributions such as Bernoulli and Poisson. While GLBs are widely applicable to real-world scenarios, their non-linear nature introduces significant challenges in achieving both computational and statistical efficiency. Existing methods typically trade off between two objectives, either incurring high per-round costs for optimal regret guarantees or compromising statistical efficiency to enable constant-time updates. In this paper, we propose a jointly efficient algorithm that attains a nearly optimal regret bound with $\mathcal{O}(1)$ time and space complexities per round. The core of our method is a tight confidence set for the online mirror descent (OMD) estimator, which is derived through a novel analysis that leverages the notion of mix loss from online prediction. The analysis shows that our OMD estimator, even with its one-pass updates, achieves statistical efficiency comparable to maximum likelihood estimation, thereby leading to a jointly efficient optimistic method.

翻译：本文研究广义线性赌博机问题，这是一种上下文多臂赌博机框架，通过引入非线性链接函数扩展经典线性模型，从而能够建模伯努利分布、泊松分布等广泛类别的奖励分布。尽管广义线性赌博机在现实场景中具有广泛应用，但其非线性特性为实现计算效率与统计效率的兼顾带来了显著挑战。现有方法通常需在两者间进行权衡：要么为获得最优遗憾保证而承受每轮高计算成本，要么为实现常数时间更新而牺牲统计效率。本文提出一种联合高效算法，在每轮仅需$\mathcal{O}(1)$时间与空间复杂度的条件下达到近乎最优的遗憾界。该算法的核心在于为在线镜像下降估计量构建紧致的置信集，其推导基于在线预测中的混合损失概念进行创新性分析。分析表明，即使采用单次更新机制，我们的在线镜像下降估计量仍能达到与极大似然估计相当的统计效率，从而形成一种兼具计算与统计效率的乐观算法。