Handling missing values in risk prediction modeling: a comparative simulation study on parametric and machine learning multiple imputations

Wu, Yuxin

Handling missing values in risk prediction modeling: a comparative simulation study on parametric and machine learning multiple imputations

dc.contributor.advisor	Su, Yu-Ru
dc.contributor.author	Wu, Yuxin
dc.date.accessioned	2023-09-27T17:18:17Z
dc.date.available	2023-09-27T17:18:17Z
dc.date.issued	2023-09-27
dc.date.submitted	2023
dc.description	Thesis (Master's)--University of Washington, 2023
dc.description.abstract	Risk prediction is a critical tool in preventive medicine, enabling precision prevention for diseases. Electronic health record (EHR) data offers a rich source for constructing risk models, capturing detailed clinical information from patient cohorts. However, missing data poses a prevalent challenge in EHR analysis, and multiple imputation (MI) is a popular strategy for handling missing data. In this thesis, we employed simulations to compare different MI methods (parametric MI, MI using Random Forest, MI using Gradient Boosting Machines and MI using Principal Component Analysis) within the context of risk prediction modeling. Our investigation focused on evaluating predictive performance, encompassing measures of predictive accuracy and precision, for risk prediction models developed and assessed in datasets processed with various MI strategies. Furthermore, we explored two facets: (1) the impacts of including or omitting the outcome variable during MI, and (2) the impacts of model misspecification of higher-order effects during MI. We also used breast surveillance mammogram examination data from breast cancer survivors in the Breast Cancer Surveillance Consortium (BCSC) as the input for part of the bootstrapping and data illustration complementary to the simulation study. Our results revealed that the adoption of machine learning-based imputation methods did not lead to superior model performance compared to traditional parametric imputation. We recommend against including the outcome variable in the imputation model for the test set since it may raise concerns of over-optimistic predictive performance. Although it is not the focus of this thesis, we also recommend beingcautious of using Random Forest as the risk prediction model for similar prediction modeling settings.
dc.embargo.terms	Open Access
dc.format.mimetype	application/pdf
dc.identifier.other	Wu_washington_0250O_26130.pdf
dc.identifier.uri	http://hdl.handle.net/1773/50711
dc.language.iso	en_US
dc.rights	CC BY
dc.subject	Clinical prediction models
dc.subject	Machine learning
dc.subject	Multiple imputation
dc.subject	Biostatistics
dc.subject.other	Biostatistics
dc.title	Handling missing values in risk prediction modeling: a comparative simulation study on parametric and machine learning multiple imputations
dc.type	Thesis

Files

Original bundle

Now showing 1 - 1 of 1

Name:: Wu_washington_0250O_26130.pdf
Size:: 6.64 MB
Format:: Adobe Portable Document Format

Download

Collections

Biostatistics