Use of the Electronic Health Records to facilitate phenotyping,  comorbidity analysis, and genomics

Xian, Su

Use of the Electronic Health Records to facilitate phenotyping, comorbidity analysis, and genomics

dc.contributor.advisor	Tarczy-Hornoch, Peter
dc.contributor.author	Xian, Su
dc.date.accessioned	2025-01-23T20:03:19Z
dc.date.available	2025-01-23T20:03:19Z
dc.date.issued	2025-01-23
dc.date.submitted	2024
dc.description	Thesis (Ph.D.)--University of Washington, 2024
dc.description.abstract	Since the wide adoption of electronic health records (EHR) in 2010, many topics regarding the secondary use of the EHR received attention. The secondary use of EHR usually indicates repurposing the EHR data for research use, including information extraction, phenotyping, disease surveillance and forecasting, and policy making. Within this context, we ask how to use the EHR data to study the disease of interest, especially identifying new knowledge. In this work, we explored the secondary use of EHR from both unsupervised and supervised methods, exploring the potential of utilizing the EHR data to identify novel disease patterns and investigate disease etiology. In aim 1, we present an unsupervised approach for embedding high-dimensional EHR data at the patient level to help characterize patients and identify new disease patterns. Inspired by the modern language model architecture - transformers, with the attention mechanism - we use patient diagnosis and procedure codes as vocabularies and treat each patient as a sentence to perform the patient embedding. Using 34,851 medical codes for 1,046,649 longitudinal patient events, we performed embedding for 102,739 patients in the electronic MEdical Records and GEnomics (eMERGE) Network. In aim 2, we illustrated several downstream task applications of the patient embedding, especially providing insights into comorbidity patterns and the progressional trajectory of individual patients within certain diseases of interest. We demonstrated excellent performance in the prediction of future disease events (median AUROC = 0.87, one year within the future), and bulk-phenotyping (median AUROC = 0.84). More importantly, we illustrated the use of patient vectors to reveal heterogeneity comorbidity patterns (disease subtypes) within a defined phenotype and captured their disease trajectory longitudinally. Our model is externally validated using the EHR dataset from the University of Washington, showing robustness and stable performance. These results paved the way for using representation learning in the EHR to characterize patients with certain diseases of interest and associated clinical outcomes that can promote disease forecasting performances and facilitate personalized medicine. In Aim 3, we utilized an EHR-derived and validated rule-based phenotyping algorithm to establish the cohort for identifying genetic risk factors for depression. We illustrated the application of genomic study using this EHR-derived algorithm to facilitate the study of disease etiology using genetics. We took a complex psychiatric disease -- depression, a leading cause of disability -- as an example, to study the genetic predisposition using data from the EHR. Large-scale genomic studies have identified common variants associated with depression. However, the complexity of the depression phenotype caused its suffering from inconsistent cohort definition and limited sample sizes. There is a need for a validated, automated EHR phenotyping algorithm that can accurately identify depression in the clinic. Here, we implemented a validated EHR phenotyping algorithm to construct a depression cohort (11,532 cases and 39,631 controls, total n = 51,163) and conducted a genome-wide association study (GWAS) using this cohort. Our study reproduced previously identified genetic associations (PHF5A, KCNG2) with depression susceptibility. We also identified novel SNPs falling into the HLA region and the IGVH region, indicating an association between the immune function and depression phenotype. In addition, we also demonstrated the robustness of our phenotyping algorithm through genetic correlation analysis, using a large meta-analysis of major depressive disorder as a standard. Together, this work served as a non-exhaustive but powerful demonstration of the use of the EHR data both in a supervised and unsupervised manner, to facilitate many downstream clinical applications, including phenotyping, comorbidity analysis, and genomics.
dc.embargo.terms	Open Access
dc.format.mimetype	application/pdf
dc.identifier.other	Xian_washington_0250E_27746.pdf
dc.identifier.uri	https://hdl.handle.net/1773/52687
dc.language.iso	en_US
dc.rights	CC BY
dc.subject	Artificial intelligence
dc.subject	Digital medicine
dc.subject	Electronic Health Records
dc.subject	Genetics
dc.subject	Language model
dc.subject	Medicine
dc.subject	Health sciences
dc.subject	Artificial intelligence
dc.subject.other	To Be Assigned
dc.title	Use of the Electronic Health Records to facilitate phenotyping, comorbidity analysis, and genomics
dc.type	Thesis

Files

Original bundle

Now showing 1 - 1 of 1

Name:: Xian_washington_0250E_27746.pdf
Size:: 8.89 MB
Format:: Adobe Portable Document Format

Download

Collections

Biomedical and health informatics