An important aspect of the performance of algorithms that predict individualized treatment effects (ITE) is moderate calibration, i.e., the average treatment effect among individuals with predicted treatment effect of z being equal to z. The assessment of moderate calibration is a challenging task on two fronts: counterfactual responses are unobserved, and quantifying the conditional response function for models that generate continuous predicted values requires regularization or parametric modeling. Perhaps because of these challenges, there is currently no inferential method for the null hypothesis that an ITE model is moderately calibrated in a population. In this work, we propose non-parametric methods for the assessment of moderate calibration of ITE models for binary outcomes using data from a randomized trial. These methods simultaneously resolve both challenges, resulting in novel numerical, graphical, and inferential methods for the assessment of moderate calibration. The key idea is to formulate a stochastic process for the cumulative prediction errors that obeys a functional central limit theorem, enabling the use of the properties of Brownian motion for asymptotic inference. We propose two approaches to construct this process from a sample: a conditional approach that relies on predicted risks (often an output of ITE models), and a marginal approach based on replacing the cumulative conditional expected value and variance terms with their marginal counterparts. Numerical simulations confirm the desirable properties of both approaches and their ability to detect miscalibration of different forms. We use a case study to provide practical suggestions on graphical presentation and the interpretation of results. Moderate calibration of predicted ITEs can be assessed without requiring regularization techniques or making assumptions about the functional form of treatment response.
翻译:个体化治疗效果预测算法的性能评估中,适度校准是一个关键方面,即预测治疗效果为z的个体中实际平均治疗效果应等于z。适度校准的评估面临两大挑战:反事实响应无法观测,且对于生成连续预测值的模型,量化条件响应函数需要正则化或参数化建模。可能由于这些挑战,目前尚不存在针对个体化治疗效果模型在总体中满足适度校准这一零假设的推断方法。本研究提出基于随机试验数据的二分类结局个体化治疗效果模型适度校准的非参数评估方法。这些方法同时解决了上述两个挑战,开发了用于评估适度校准的新型数值、图形和推断方法。核心思想是构建一个服从函数中心极限定理的累积预测误差随机过程,从而可利用布朗运动的性质进行渐近推断。我们提出两种基于样本构建该过程的途径:一种依赖于预测风险的条件方法,另一种基于用边际对应项替换累积条件期望和方差项的边际方法。数值模拟验证了两种方法的优良特性及其检测不同形式校准偏差的能力。通过案例研究,我们为图形展示和结果解读提供了实用建议。个体化治疗效果的适度校准评估无需依赖正则化技术或对治疗响应函数形式进行假设。