Comparing Internal Validation Methods for a Random Forest Prediction Model of Suicide Death

dc.contributor.advisorColey, Yates
dc.contributor.authorLiao, Qinqing
dc.date.accessioned2020-10-26T20:39:50Z
dc.date.available2020-10-26T20:39:50Z
dc.date.issued2020-10-26
dc.date.submitted2020
dc.descriptionThesis (Master's)--University of Washington, 2020
dc.description.abstractPredictive models estimated with clinical data are increasingly popular in the medical data field. After developing a prediction model, its necessary to evaluate its performance in practice, or validate the model. Model validation methods include both internal and external validation; this thesis will focus on the comparison of internal validation methods using a split sample and an entire sample approach. The split sample approach uses a typical randomly selected validation set. For the entire sample approach, we explored three different methods – approximate optimism correction, exact optimism correction and 5-fold cross validation (CV). The dataset included 13,980,570 records on mental health outpatient visits between 2011 - 2017, including information on prior diagnoses, medications, and encounters prior to the visit and follow-up information on suicide death. Data were separated into a development dataset, which included visits from 2011 - 2014 and was used for model estimation and internal validation, and a prospective validation set, which included visits from 2015 - 2017 and was used to mimic the future data if the model were implemented in clinical practice. We estimated a random forest model to predict suicide death in the 90 days following a visit. We found that the split sample estimation method and 5-fold CV using the entire sample provided more accurate estimation of model performance compared to the exact and optimism correction methods using the entire sample, which both underestimated model optimism and, thus, overestimated model performance in the prospective dataset. Our results stand in contrast to prior research which demonstrated the accuracy of optimism correction methods with logistic regression models estimated using an entire sample approach. While findings may differ for other datasets, model estimation methods, and prediction applications, we recommend caution when using optimism correction methods for internal validation of prediction models estimated in the entire sample when working with very large datasets, rare events, and machine learning prediction models.
dc.embargo.termsOpen Access
dc.format.mimetypeapplication/pdf
dc.identifier.otherLiao_washington_0250O_22108.pdf
dc.identifier.urihttp://hdl.handle.net/1773/46384
dc.language.isoen_US
dc.rightsnone
dc.subjectElectronic Health Records
dc.subjectInternal Validation
dc.subjectOptimism Correction
dc.subjectPredictive Model
dc.subjectRandom Forest
dc.subjectBiostatistics
dc.subject.otherBiostatistics
dc.titleComparing Internal Validation Methods for a Random Forest Prediction Model of Suicide Death
dc.typeThesis

Files

Original bundle

Now showing 1 - 1 of 1
Loading...
Thumbnail Image
Name:
Liao_washington_0250O_22108.pdf
Size:
660.32 KB
Format:
Adobe Portable Document Format

Collections