Sampling designs for resource efficient collection of outcome labels for machine-learning, with application to electronic medical records

Tan, Wei Ling Katherine

Sampling designs for resource efficient collection of outcome labels for machine-learning, with application to electronic medical records

dc.contributor.advisor	Heagerty, Patrick J
dc.contributor.author	Tan, Wei Ling Katherine
dc.date.accessioned	2019-02-22T17:03:11Z
dc.date.issued	2019-02-22
dc.date.submitted	2018
dc.description	Thesis (Ph.D.)--University of Washington, 2018
dc.description.abstract	In leveraging data from large-scale electronic medical record systems for research, an important step is the accurate identification of key clinical outcomes. Some outcomes must be derived or predicted from both structured and unstructured data, for example using statistical machine-learning classification. Classification requires the collection of labeled data, which is a sample where actual outcome statuses are manually coded by human clinical experts. For rare outcomes, simple random sampling (SRS) for labeled data collection results in very few cases in the sample. Such outcome class imbalance results in insufficient information for classifier modeling, yet additional abstraction is often expensive and time-consuming. In this dissertation, we propose sampling designs for labeled data collection towards machine-learning, targeting the rare outcome scenario. Our proposed designs are resource efficient, requiring a smaller sample size for modeling goals compared to SRS, yet design impacts on model development and validation can be statistically characterized to be "valid". We first introduce a stratified sampling procedure based on values of enrichment surrogates, which are summaries of structured data related to the clinical outcome requiring abstraction. Next, motivated by radiology reports with multiple co-occurring findings, we discuss extensions to the multi-label setting. Finally, for scenarios where a previously developed "source" model is to be externally transferred, we propose a framework for such "new'' labeled data collection.
dc.embargo.lift	2021-02-11T17:03:11Z
dc.embargo.terms	Restrict to UW for 2 years -- then make Open Access
dc.format.mimetype	application/pdf
dc.identifier.other	Tan_washington_0250E_19428.pdf
dc.identifier.uri	http://hdl.handle.net/1773/43317
dc.language.iso	en_US
dc.rights	none
dc.subject	clinical research
dc.subject	electronic health records
dc.subject	epidemiology
dc.subject	machine-learning
dc.subject	sampling design
dc.subject	Biostatistics
dc.subject.other	Biostatistics
dc.title	Sampling designs for resource efficient collection of outcome labels for machine-learning, with application to electronic medical records
dc.type	Thesis

Files

Original bundle

Now showing 1 - 1 of 1

Name:: Tan_washington_0250E_19428.pdf
Size:: 8.06 MB
Format:: Adobe Portable Document Format

Download

Collections

Biostatistics