Handling missing values in risk prediction modeling: a comparative simulation study on parametric and machine learning multiple imputations

dc.contributor.advisorSu, Yu-Ru
dc.contributor.authorWu, Yuxin
dc.date.accessioned2023-09-27T17:18:17Z
dc.date.available2023-09-27T17:18:17Z
dc.date.issued2023-09-27
dc.date.submitted2023
dc.descriptionThesis (Master's)--University of Washington, 2023
dc.description.abstractRisk prediction is a critical tool in preventive medicine, enabling precision prevention for diseases. Electronic health record (EHR) data offers a rich source for constructing risk models, capturing detailed clinical information from patient cohorts. However, missing data poses a prevalent challenge in EHR analysis, and multiple imputation (MI) is a popular strategy for handling missing data. In this thesis, we employed simulations to compare different MI methods (parametric MI, MI using Random Forest, MI using Gradient Boosting Machines and MI using Principal Component Analysis) within the context of risk prediction modeling. Our investigation focused on evaluating predictive performance, encompassing measures of predictive accuracy and precision, for risk prediction models developed and assessed in datasets processed with various MI strategies. Furthermore, we explored two facets: (1) the impacts of including or omitting the outcome variable during MI, and (2) the impacts of model misspecification of higher-order effects during MI. We also used breast surveillance mammogram examination data from breast cancer survivors in the Breast Cancer Surveillance Consortium (BCSC) as the input for part of the bootstrapping and data illustration complementary to the simulation study. Our results revealed that the adoption of machine learning-based imputation methods did not lead to superior model performance compared to traditional parametric imputation. We recommend against including the outcome variable in the imputation model for the test set since it may raise concerns of over-optimistic predictive performance. Although it is not the focus of this thesis, we also recommend beingcautious of using Random Forest as the risk prediction model for similar prediction modeling settings.
dc.embargo.termsOpen Access
dc.format.mimetypeapplication/pdf
dc.identifier.otherWu_washington_0250O_26130.pdf
dc.identifier.urihttp://hdl.handle.net/1773/50711
dc.language.isoen_US
dc.rightsCC BY
dc.subjectClinical prediction models
dc.subjectMachine learning
dc.subjectMultiple imputation
dc.subjectBiostatistics
dc.subject.otherBiostatistics
dc.titleHandling missing values in risk prediction modeling: a comparative simulation study on parametric and machine learning multiple imputations
dc.typeThesis

Files

Original bundle

Now showing 1 - 1 of 1
Loading...
Thumbnail Image
Name:
Wu_washington_0250O_26130.pdf
Size:
6.64 MB
Format:
Adobe Portable Document Format

Collections