Sampling designs for resource efficient collection of outcome labels for machine-learning, with application to electronic medical records
| dc.contributor.advisor | Heagerty, Patrick J | |
| dc.contributor.author | Tan, Wei Ling Katherine | |
| dc.date.accessioned | 2019-02-22T17:03:11Z | |
| dc.date.issued | 2019-02-22 | |
| dc.date.submitted | 2018 | |
| dc.description | Thesis (Ph.D.)--University of Washington, 2018 | |
| dc.description.abstract | In leveraging data from large-scale electronic medical record systems for research, an important step is the accurate identification of key clinical outcomes. Some outcomes must be derived or predicted from both structured and unstructured data, for example using statistical machine-learning classification. Classification requires the collection of labeled data, which is a sample where actual outcome statuses are manually coded by human clinical experts. For rare outcomes, simple random sampling (SRS) for labeled data collection results in very few cases in the sample. Such outcome class imbalance results in insufficient information for classifier modeling, yet additional abstraction is often expensive and time-consuming. In this dissertation, we propose sampling designs for labeled data collection towards machine-learning, targeting the rare outcome scenario. Our proposed designs are resource efficient, requiring a smaller sample size for modeling goals compared to SRS, yet design impacts on model development and validation can be statistically characterized to be "valid". We first introduce a stratified sampling procedure based on values of enrichment surrogates, which are summaries of structured data related to the clinical outcome requiring abstraction. Next, motivated by radiology reports with multiple co-occurring findings, we discuss extensions to the multi-label setting. Finally, for scenarios where a previously developed "source" model is to be externally transferred, we propose a framework for such "new'' labeled data collection. | |
| dc.embargo.lift | 2021-02-11T17:03:11Z | |
| dc.embargo.terms | Restrict to UW for 2 years -- then make Open Access | |
| dc.format.mimetype | application/pdf | |
| dc.identifier.other | Tan_washington_0250E_19428.pdf | |
| dc.identifier.uri | http://hdl.handle.net/1773/43317 | |
| dc.language.iso | en_US | |
| dc.rights | none | |
| dc.subject | clinical research | |
| dc.subject | electronic health records | |
| dc.subject | epidemiology | |
| dc.subject | machine-learning | |
| dc.subject | sampling design | |
| dc.subject | Biostatistics | |
| dc.subject.other | Biostatistics | |
| dc.title | Sampling designs for resource efficient collection of outcome labels for machine-learning, with application to electronic medical records | |
| dc.type | Thesis |
Files
Original bundle
1 - 1 of 1
Loading...
- Name:
- Tan_washington_0250E_19428.pdf
- Size:
- 8.06 MB
- Format:
- Adobe Portable Document Format
