Explainable query generation for cohort discovery and biomedical reasoning using natural language
Loading...
Date
Authors
Dobbins, Nicholas J
Journal Title
Journal ISSN
Volume Title
Publisher
Abstract
Clinical trials serve a critical role in the generation of medical evidence and enablingbiomedical research. In order to identify potential participants, investigators publish eligibility
criteria, such as past history of certain conditions, treatments, or laboratory tests.
Patients meeting a trial’s eligibility criteria are considered potential candidates for recruitment.
Recruitment of participants remains, however, a major barrier to successful trial
completion, and manual chart review of hundreds or thousands of patients to determine a
candidate pool can be prohibitively labor- and time-intensive. At the same time, the amount and variety of data contained in Electronic Health Records(EHRs) is increasing dramatically, creating both challenges and opportunities for patient
recruitment. While more granular and potentially useful data are captured and stored in
EHRs now than in the past, the process of accessing and leveraging these data often requires
technical expertise and extensive knowledge of biomedical terminologies and data models.
This thesis focuses on the development of an integrated system for identifying patients in
clinical databases using a natural language interface. Humans use natural language nearly
effortlessly, and thus automated means of leveraging natural language to identify patients
in databases hold great potential in time and cost savings. The primary contributions of
this work include a novel database schema annotation and mapping method enabling data
model agnostic query generation, a method for generating intermediate logical representations
of eligibility criteria, exploration of dynamic reasoning upon non-specific criteria, and
development of an integrated graph-based knowledge base of biomedical concepts.
This work also introduces two new annotated corpora, the Leaf Clinical Trials (LCT)
corpus and Leaf Logical Forms (LLF) corpus. The LCT corpus is unique in the granularity
with which it represents complex eligibility criteria, while the LLF corpus is the most extensive
annotated corpus of eligibility criteria logical representations at the time of this writing.
Both corpora are valuable contributions to the biomedical informatics and natural language
processing communities. To evaluate the viability of our methods, both our system and a human database programmergenerated queries to identify patients eligible for 8 past clinical trials at our institution.
We then compared actual participant enrollments to those found eligible. We demonstrate
that our system rivals and sometimes surpasses an experienced human programmer in finding
eligible patients. We finally developed a novel user interface for enabling real-time interactive
cohort discovery.
Description
Thesis (Ph.D.)--University of Washington, 2023
