Extracting Clinical Information from Unstructured EHRs using Language Models, and its Role in Disease Prediction
Loading...
Date
Authors
Journal Title
Journal ISSN
Volume Title
Publisher
Abstract
Clinical unstructured data contain critical information for clinical decision making, suchas symptoms and radiology findings, that can complement structured EHRs and often add
greater details. However, clinically relevant information can be buried in abundant EHR
unstructured notes, which can challenge physicians to review. In addition, large volumes of
texts can include irrelevant information to secondary machine learning applications. We aim
to develop language model–based information extraction (IE) methods to extract clinically
critical information from EHR texts, supporting human review and secondary clinical decision
applications. We first develop robust event extraction methods using supervised learning to
identify clinical events at the sentence level, and improve their domain generalization across
different domain shifts. In one study on symptom event extraction, we demonstrate that
two strategies, adaptive pretraining on unstructured EHRs and masking frequent symptoms
during training, improve domain generalization when using an encoder-only language model.
In a second study on radiological findings extraction, we show that generative LMs generalize
better than encoder-only models in categorizing minority classes, and further training them
on decomposed, simpler subtasks improves generalization to complex tasks when subtask
dependencies are shifted across domains. In addition to event extraction from isolated reports,
we present longitudinal summarization of radiology reports as an additional IE task to track
radiological findings and capture temporal changes not reflected in individual reports. We
frame longitudinal summarization as a timeline generation task that groups related findings
across time, and introduce RadTimeline as an evaluation dataset, and propose an LLM-based
approach that achieves good recall of lung findings and human-comparable grouping of
gold-standard findings without training data. Finally, we apply information extraction
to extract risk factors from longitudinal EHRs for a secondary-use clinical application,
a lung cancer prediction task. We create a lung cancer case-control cohort, where each
patient has a 5-year longitudinal EHR history and a lung cancer outcome within three
years. We find that COPD, smoking status and radiology abnormality information extracted
from unstructured notes can complement the structured EHRs, and improve lung cancer
risk prediction performance. Using a transformer-based risk prediction model, we further
compare different representations of longitudinal risk factors across model variants and input
orderings, finding that there is no benefit from including findings from reports beyond a
6-month window.
Description
Thesis (Ph.D.)--University of Washington, 2026
