Kernel Methods for Data Integration in Microbiome-Omics Studies

Ma, Nanxun

Kernel Methods for Data Integration in Microbiome-Omics Studies

Files

Ma_washington_0250E_23821.pdf (829.25 KB)

Date

2022-04-19

relationships.isAuthorOf

Ma, Nanxun

Abstract

The human microbiome plays an important role for maintaining the external and internal environment of human health and is associated with many different health conditions and diseases. Meanwhile, other sources of omics data are usually collected simultaneously. However, it is remained a scientific objective to integrate microbiome data with other types of omics data, considering a sequence of challenges including high dimensionality, compositional structures, non-linear effects, and missing data. Facing these challenges, we focus on development of novel statistical approaches for integrative analysis using kernel method, with particular emphasis on integrating microbiome and other types of omics data. Kernel methods are popular for high-dimensional data due to their ability to accommodate nonlinear effects and have been tailored to capture important data-type specific effects. Within this context, we will use kernel approaches to improve understanding of the relationship between data types and to improve the analyses of the individual data types in relation to others.In the first part of this dissertation, we propose to use a sparse kernel RV (KRV) coefficient to facilitate the identification of genomic features associated with overall microbiome composition (beta-diversity). The KRV is a generalized measure of multivariate correlation between two data sets, in this case microbiome and genomics, that are embedded as kernel matrices. For microbiome data, we construct fixed, ecologically relevant kernels incorporating important ecological structure. For genomic data, we construct kernels which include feature- specific weights. Sparse estimation of the weights enables selection of genomic markers. The difficulties of integrating microbiome data with metabolites data remain unanswered for classification problems, when we have both types of data for training with labels, but only metabolites data for prediction. In the second part of the dissertation, we develop classification models using multiple data types that can be applied to future data sets in which only one category of data is collected. Hence, we introduce kernel structures into discriminant analysis, and develop the kernel linear discriminant analysis (KLDA), which can leverage the prediction accuracy utilizing data that are only partially exist. The general KLDA can handle high dimensionality of microbiome data but not the omics data. We then propose a penalized version of KLDA, which can incorporate different types of penalty terms per request of different types of omics data, for example L1 or L2 penalties, to handle the situation that both datasets are high-dimensional with as a classification method. We evaluate the performance of these methods through extensive simulation studies and apply them to studies investigating the association of an inflammatory bowel disease and women menopause strategies with microbiome data.