罕见事件下预测性能指标的行为表现 (Behavior of prediction performance metrics with rare events)

from arxiv, Accepted for publication in the Journal of Clinical Epidemiology. 51 pages (16 main, 35 supplementary), 26 tables (3 main, 23 supplementary), 6 figures (4 main, 2 supplementary)

Objective: Area under the receiving operator characteristic curve (AUC) is commonly reported alongside prediction models for binary outcomes. Recent articles have raised concerns that AUC might be a misleading measure of prediction performance in the rare event setting. This setting is common since many events of clinical importance are rare. We aimed to determine whether the bias and variance of AUC are driven by the number of events or the event rate. We also investigated the behavior of other commonly used measures of prediction performance, including positive predictive value, accuracy, sensitivity, and specificity. Study Design and Setting: We conducted a simulation study to determine when or whether AUC is unstable in the rare event setting by varying the size of datasets used to train and evaluate prediction models. This plasmode simulation study was based on data from the Mental Health Research Network; the data contained 149 predictors and the outcome of interest, suicide attempt, which had event rate 0.92\% in the original dataset. Results: Our results indicate that poor AUC behavior -- as measured by empirical bias, variability of cross-validated AUC estimates, and empirical coverage of confidence intervals -- is driven by the number of events in a rare-event setting, not event rate. Performance of sensitivity is driven by the number of events, while that of specificity is driven by the number of non-events. Other measures, including positive predictive value and accuracy, depend on the event rate even in large samples. Conclusion: AUC is reliable in the rare event setting provided that the total number of events is moderately large; in our simulations, we observed near zero bias with 1000 events.

翻译：目的：受试者工作特征曲线下面积（AUC）常作为二元结局预测模型的报告指标。近期研究指出，在罕见事件场景中，AUC可能成为误导性的预测性能度量。由于许多临床重要事件均属罕见，该场景具有普遍性。本研究旨在探究AUC的偏差与方差是由事件数量还是事件率驱动，并考察其他常用预测性能指标（包括阳性预测值、准确率、灵敏度及特异度）的行为特征。研究设计与场景：通过改变训练与评估预测模型所用数据集的规模，开展模拟研究以确定AUC在罕见事件场景中何时或是否出现不稳定现象。此项基于真实数据的模拟研究以心理健康研究网络数据为基础，该数据集包含149个预测变量及目标结局（自杀企图），原始数据中事件率为0.92%。结果：研究表明，在罕见事件场景中，AUC的不良表现（通过经验偏差、交叉验证AUC估计的变异性及置信区间的经验覆盖率衡量）由事件数量驱动，而非事件率。灵敏度的性能受事件数量影响，特异度的性能受非事件数量影响。其他指标（包括阳性预测值与准确率）即使在大样本中仍受事件率影响。结论：只要事件总数适度充足（本模拟研究中观察到1000个事件时偏差接近零），AUC在罕见事件场景中具有可靠性。