Extracting Clinical Information from Unstructured EHRs using Language Models, and its Role in Disease Prediction

Zhou, Sitong

Extracting Clinical Information from Unstructured EHRs using Language Models, and its Role in Disease Prediction

Files

Zhou_washington_0250E_29350.pdf (7.93 MB)

Date

2026-04-20

Authors

Zhou, Sitong

Abstract

Clinical unstructured data contain critical information for clinical decision making, suchas symptoms and radiology findings, that can complement structured EHRs and often add greater details. However, clinically relevant information can be buried in abundant EHR unstructured notes, which can challenge physicians to review. In addition, large volumes of texts can include irrelevant information to secondary machine learning applications. We aim to develop language model–based information extraction (IE) methods to extract clinically critical information from EHR texts, supporting human review and secondary clinical decision applications. We first develop robust event extraction methods using supervised learning to identify clinical events at the sentence level, and improve their domain generalization across different domain shifts. In one study on symptom event extraction, we demonstrate that two strategies, adaptive pretraining on unstructured EHRs and masking frequent symptoms during training, improve domain generalization when using an encoder-only language model. In a second study on radiological findings extraction, we show that generative LMs generalize better than encoder-only models in categorizing minority classes, and further training them on decomposed, simpler subtasks improves generalization to complex tasks when subtask dependencies are shifted across domains. In addition to event extraction from isolated reports, we present longitudinal summarization of radiology reports as an additional IE task to track radiological findings and capture temporal changes not reflected in individual reports. We frame longitudinal summarization as a timeline generation task that groups related findings across time, and introduce RadTimeline as an evaluation dataset, and propose an LLM-based approach that achieves good recall of lung findings and human-comparable grouping of gold-standard findings without training data. Finally, we apply information extraction to extract risk factors from longitudinal EHRs for a secondary-use clinical application, a lung cancer prediction task. We create a lung cancer case-control cohort, where each patient has a 5-year longitudinal EHR history and a lung cancer outcome within three years. We find that COPD, smoking status and radiology abnormality information extracted from unstructured notes can complement the structured EHRs, and improve lung cancer risk prediction performance. Using a transformer-based risk prediction model, we further compare different representations of longitudinal risk factors across model variants and input orderings, finding that there is no benefit from including findings from reports beyond a 6-month window.