Comparing Internal Validation Methods for a Random Forest Prediction Model of Suicide Death

Liao, Qinqing

Comparing Internal Validation Methods for a Random Forest Prediction Model of Suicide Death

dc.contributor.advisor	Coley, Yates
dc.contributor.author	Liao, Qinqing
dc.date.accessioned	2020-10-26T20:39:50Z
dc.date.available	2020-10-26T20:39:50Z
dc.date.issued	2020-10-26
dc.date.submitted	2020
dc.description	Thesis (Master's)--University of Washington, 2020
dc.description.abstract	Predictive models estimated with clinical data are increasingly popular in the medical data field. After developing a prediction model, its necessary to evaluate its performance in practice, or validate the model. Model validation methods include both internal and external validation; this thesis will focus on the comparison of internal validation methods using a split sample and an entire sample approach. The split sample approach uses a typical randomly selected validation set. For the entire sample approach, we explored three different methods – approximate optimism correction, exact optimism correction and 5-fold cross validation (CV). The dataset included 13,980,570 records on mental health outpatient visits between 2011 - 2017, including information on prior diagnoses, medications, and encounters prior to the visit and follow-up information on suicide death. Data were separated into a development dataset, which included visits from 2011 - 2014 and was used for model estimation and internal validation, and a prospective validation set, which included visits from 2015 - 2017 and was used to mimic the future data if the model were implemented in clinical practice. We estimated a random forest model to predict suicide death in the 90 days following a visit. We found that the split sample estimation method and 5-fold CV using the entire sample provided more accurate estimation of model performance compared to the exact and optimism correction methods using the entire sample, which both underestimated model optimism and, thus, overestimated model performance in the prospective dataset. Our results stand in contrast to prior research which demonstrated the accuracy of optimism correction methods with logistic regression models estimated using an entire sample approach. While findings may differ for other datasets, model estimation methods, and prediction applications, we recommend caution when using optimism correction methods for internal validation of prediction models estimated in the entire sample when working with very large datasets, rare events, and machine learning prediction models.
dc.embargo.terms	Open Access
dc.format.mimetype	application/pdf
dc.identifier.other	Liao_washington_0250O_22108.pdf
dc.identifier.uri	http://hdl.handle.net/1773/46384
dc.language.iso	en_US
dc.rights	none
dc.subject	Electronic Health Records
dc.subject	Internal Validation
dc.subject	Optimism Correction
dc.subject	Predictive Model
dc.subject	Random Forest
dc.subject	Biostatistics
dc.subject.other	Biostatistics
dc.title	Comparing Internal Validation Methods for a Random Forest Prediction Model of Suicide Death
dc.type	Thesis

Files

Original bundle

Now showing 1 - 1 of 1

Name:: Liao_washington_0250O_22108.pdf
Size:: 660.32 KB
Format:: Adobe Portable Document Format

Download

Collections

Biostatistics