Statistical Methods for the Analysis of Microbiome Data

Plantinga, Anna Marie

Statistical Methods for the Analysis of Microbiome Data

Files

Plantinga_washington_0250E_19032.pdf (2.17 MB)

Date

2018-11-28

relationships.isAuthorOf

Plantinga, Anna Marie

Abstract

The human microbiome plays a vital role in maintaining health, and imbalances in the microbiome are associated with a wide variety of diseases. Understanding whether and how the microbiome is associated with particular health conditions is a focus of many modern microbiome studies, with the hope that a deeper understanding of these associations may lead to more effective prevention and treatment regimens. However, how best to analyze data from microbiome profiling studies remains unclear. The high dimensionality, compositional nature, intrinsic biological structure, and limited availability of samples pose substantial statistical challenges. To face these challenges, we propose novel analytic approaches based on sparse penalized regression strategies and distance-based global association analysis. Most distance-based methods for global microbiome association analysis are restricted to simple dichotomous or quantitative outcomes, but more complex outcomes are increasingly common in microbiome studies. In the first part of this dissertation, we introduce two distance-based methods for the analysis of entire microbial communities in modern microbiome studies. We develop a kernel machine regression-based score test for association between the microbiome and censored time-to-event outcomes. We then propose a novel longitudinal measure of dissimilarity that summarizes changes in the microbiome across time and compares these changes between subjects. Since this dissimilarity may be incorporated into any distance-based analysis framework, it is a highly flexible tool for applying a wide variety of distance-based analyses in longitudinal studies. Identification of associated taxa and detection of predictive microbial signatures are key to translation of microbiome studies. In the second part of this dissertation, we present two penalized regression methods for estimation and prediction with high-dimensional compositional data. Because phylogenetic similarity between bacteria often corresponds to shared functions, our first contribution is to incorporate phylogenetic structure into a penalized regression model for constrained data. We then propose a model that exploits phylogenetic structure to use partial information in the setting of differing feature sets between model-building and prediction datasets. We evaluate the performance of these methods through extensive simulation studies and apply them to studies investigating the association of graft-versus-host disease or body mass index with the gut microbiome.