Explainable query generation for cohort discovery and biomedical reasoning using natural language

Dobbins, Nicholas J

Explainable query generation for cohort discovery and biomedical reasoning using natural language

Files

Dobbins_washington_0250E_26010.pdf (5.18 MB)

Date

2023-09-27

relationships.isAuthorOf

Dobbins, Nicholas J

Abstract

Clinical trials serve a critical role in the generation of medical evidence and enablingbiomedical research. In order to identify potential participants, investigators publish eligibility criteria, such as past history of certain conditions, treatments, or laboratory tests. Patients meeting a trial’s eligibility criteria are considered potential candidates for recruitment. Recruitment of participants remains, however, a major barrier to successful trial completion, and manual chart review of hundreds or thousands of patients to determine a candidate pool can be prohibitively labor- and time-intensive. At the same time, the amount and variety of data contained in Electronic Health Records(EHRs) is increasing dramatically, creating both challenges and opportunities for patient recruitment. While more granular and potentially useful data are captured and stored in EHRs now than in the past, the process of accessing and leveraging these data often requires technical expertise and extensive knowledge of biomedical terminologies and data models. This thesis focuses on the development of an integrated system for identifying patients in clinical databases using a natural language interface. Humans use natural language nearly effortlessly, and thus automated means of leveraging natural language to identify patients in databases hold great potential in time and cost savings. The primary contributions of this work include a novel database schema annotation and mapping method enabling data model agnostic query generation, a method for generating intermediate logical representations of eligibility criteria, exploration of dynamic reasoning upon non-specific criteria, and development of an integrated graph-based knowledge base of biomedical concepts. This work also introduces two new annotated corpora, the Leaf Clinical Trials (LCT) corpus and Leaf Logical Forms (LLF) corpus. The LCT corpus is unique in the granularity with which it represents complex eligibility criteria, while the LLF corpus is the most extensive annotated corpus of eligibility criteria logical representations at the time of this writing. Both corpora are valuable contributions to the biomedical informatics and natural language processing communities. To evaluate the viability of our methods, both our system and a human database programmergenerated queries to identify patients eligible for 8 past clinical trials at our institution. We then compared actual participant enrollments to those found eligible. We demonstrate that our system rivals and sometimes surpasses an experienced human programmer in finding eligible patients. We finally developed a novel user interface for enabling real-time interactive cohort discovery.