Evaluation of Prediction Performance Metrics in the Rare Event Setting

Minus, Emily

Evaluation of Prediction Performance Metrics in the Rare Event Setting

Files

Minus_washington_0250O_25611.pdf (26.54 MB)

Date

2023-08-14

relationships.isAuthorOf

Minus, Emily

Abstract

Area under the receiving operator characteristic curve (AUC) is a commonly reported measure of discriminative performance for binary prediction models. However, there are concerns about AUC being a misleading measure of prediction performance in the rare event setting. This setting is commonly encountered with clinical prediction models, since many events of clinical importance, such as suicide, occur only rarely. We conducted a simulation study to investigate what drives inaccurate or unstable AUC performance in the rare event setting. Specifically, we aimed to determine whether a small number of events is the main driver of the poor AUC performance, or if the main driver is truly the event rate (i.e., there are many events, but they represent a small fraction of the total observations). We also investigated the behavior of other commonly used measures of prediction performance, such as PPV, accuracy, sensitivity, and specificity. Our results indicate that poor AUC performance---as measured by empirical bias, empirical MSE, variability of cross-validated AUC estimates, and empirical coverage of bootstrap intervals---is driven by the number of events, not event rate. While which measure of model performance is of greatest interest depends on how a model will be used, AUC is reliable in the rare event setting provided that the total number of events is moderately large.