Leveraging machine learning & interpretability methods for limited data: from systems biology to healthcare applications

Beebe-Wang, Nicasia

Leveraging machine learning & interpretability methods for limited data: from systems biology to healthcare applications

Files

BeebeWang_washington_0250E_25789.pdf (12.29 MB)

Date

2023-08-14

relationships.isAuthorOf

Beebe-Wang, Nicasia

Abstract

Although advances in machine learning (ML) have led to human-level performance in many domains, including some healthcare settings, high-performing models often rely on access to large datasets with densely observed features and labels. However, in practice, many biomedical applications involve some aspect of scarcity or sparsity that limit our ability to use standard ML models trained on big data. In particular, we focus on common limitations in computational biology and health settings including limited sample size, sparsely labeled data, and missing features or limited capacity to collect features at prediction time. We present projects, ranging in applications from computational molecular biology to healthcare applications, and for each, we describe our strategies for leveraging machine learning and explainability methods to enable prediction and interpretation of complex biomedical systems in the face of real-world data limitations. In this work, we first describe a unified framework that we developed to uncover relationships between gene expression and Alzheimer’s disease neuropathologies despite having a limited sample size and sparsely available labels by. By using multi-task deep learning in a multi-cohort setting and applying interpretability methods, we were able to identify nuanced sex-specific relationships among genes and AD. Next, we describe an automatic integrative method for learning interpretable communities of biological pathways, which we developed in an effort to aid researchers in interpreting outcomes of computational biology analyses pipelines. Finally, we turn to clinical risk prediction applications, in which medical practitioners often have limited time to collect features and assess a patient’s risk for various health-related outcomes. We first describe a project in which we used feature attributions from a dementia prediction model to identify a globally relevant subset of features. We then demonstrated that a risk prediction model trained on these features achieved similar performance compared with a standard neuropsychological battery, but in in one-fifth of the time. However, this standard approach of retraining a model on a predetermined set of selected features may not be ideal in cases when different features are already available at prediction time, or if the individual is missing these key features. Thus, we finally propose a method for clinical risk prediction which simultaneously generates risk prediction intervals given sparse existing information and dynamically suggests useful features to collect next given a patient’s current context.