Online A/B testing, the gold standard for evaluating new advertising policies, consumes substantial engineering resources and risks significant revenue loss from deploying underperforming variations. This motivates the use of Off-Policy Evaluation (OPE) for rapid, offline assessment. However, applying OPE to ad auctions is fundamentally more challenging than in domains like recommender systems, where stochastic policies are common. In online ad auctions, it is common for the highest-bidding ad to win the impression, resulting in a deterministic, winner-takes-all setting. This results in zero probability of exposure for non-winning ads, rendering standard OPE estimators inapplicable. We introduce the first principled framework for OPE in deterministic auctions by repurposing the bid landscape model to approximate the propensity score. This model allows us to derive robust approximate propensity scores, enabling the use of stable estimators like Self-Normalized Inverse Propensity Scoring (SNIPS) for counterfactual evaluation. We validate our approach on the AuctionNet simulation benchmark and against 2-weeks online A/B test from a large-scale industrial platform. Our method shows remarkable alignment with online results, achieving a 92\% Mean Directional Accuracy (MDA) in CTR prediction, significantly outperforming the parametric baseline. MDA is the most critical metric for guiding deployment decisions, as it reflects the ability to correctly predict whether a new model will improve or harm performance. This work contributes the first practical and validated framework for reliable OPE in deterministic auction environments, offering an efficient alternative to costly and risky online experiments.
翻译:在线A/B测试作为评估新广告策略的黄金标准,不仅消耗大量工程资源,且部署表现不佳的变体可能导致重大收入损失。这促使研究者采用离策略评估(OPE)进行快速离线评估。然而,将OPE应用于广告拍卖本质上比推荐系统等常见随机策略领域更具挑战性。在在线广告拍卖中,通常由出价最高的广告赢得曝光机会,形成确定性的赢家通吃场景。这导致非获胜广告的曝光概率为零,使得标准OPE估计器无法适用。我们通过重新利用出价景观模型来近似倾向得分,首次提出了适用于确定性拍卖的OPE理论框架。该模型使我们能够推导出稳健的近似倾向得分,从而支持使用自归一化逆倾向评分(SNIPS)等稳定估计器进行反事实评估。我们在AuctionNet仿真基准和大型工业平台为期两周的在线A/B测试中验证了该方法。我们的方法展现出与在线结果的高度一致性,在点击率预测中实现了92%的平均方向准确率(MDA),显著优于参数化基线。MDA作为指导部署决策的最关键指标,反映了正确预测新模型将提升还是损害性能的能力。本研究首次为确定性拍卖环境提供了经过验证的可靠OPE实用框架,为昂贵且高风险的在线实验提供了高效替代方案。