Sampling designs for resource efficient collection of outcome labels for machine-learning, with application to electronic medical records

dc.contributor.advisorHeagerty, Patrick J
dc.contributor.authorTan, Wei Ling Katherine
dc.date.accessioned2019-02-22T17:03:11Z
dc.date.issued2019-02-22
dc.date.submitted2018
dc.descriptionThesis (Ph.D.)--University of Washington, 2018
dc.description.abstractIn leveraging data from large-scale electronic medical record systems for research, an important step is the accurate identification of key clinical outcomes. Some outcomes must be derived or predicted from both structured and unstructured data, for example using statistical machine-learning classification. Classification requires the collection of labeled data, which is a sample where actual outcome statuses are manually coded by human clinical experts. For rare outcomes, simple random sampling (SRS) for labeled data collection results in very few cases in the sample. Such outcome class imbalance results in insufficient information for classifier modeling, yet additional abstraction is often expensive and time-consuming. In this dissertation, we propose sampling designs for labeled data collection towards machine-learning, targeting the rare outcome scenario. Our proposed designs are resource efficient, requiring a smaller sample size for modeling goals compared to SRS, yet design impacts on model development and validation can be statistically characterized to be "valid". We first introduce a stratified sampling procedure based on values of enrichment surrogates, which are summaries of structured data related to the clinical outcome requiring abstraction. Next, motivated by radiology reports with multiple co-occurring findings, we discuss extensions to the multi-label setting. Finally, for scenarios where a previously developed "source" model is to be externally transferred, we propose a framework for such "new'' labeled data collection.
dc.embargo.lift2021-02-11T17:03:11Z
dc.embargo.termsRestrict to UW for 2 years -- then make Open Access
dc.format.mimetypeapplication/pdf
dc.identifier.otherTan_washington_0250E_19428.pdf
dc.identifier.urihttp://hdl.handle.net/1773/43317
dc.language.isoen_US
dc.rightsnone
dc.subjectclinical research
dc.subjectelectronic health records
dc.subjectepidemiology
dc.subjectmachine-learning
dc.subjectsampling design
dc.subjectBiostatistics
dc.subject.otherBiostatistics
dc.titleSampling designs for resource efficient collection of outcome labels for machine-learning, with application to electronic medical records
dc.typeThesis

Files

Original bundle

Now showing 1 - 1 of 1
Loading...
Thumbnail Image
Name:
Tan_washington_0250E_19428.pdf
Size:
8.06 MB
Format:
Adobe Portable Document Format

Collections