Scalable statistical methods for microbial metagenomics

Teichman, Sarah

Scalable statistical methods for microbial metagenomics

dc.contributor.advisor	Willis, Amy
dc.contributor.author	Teichman, Sarah
dc.date.accessioned	2025-01-23T20:13:55Z
dc.date.available	2025-01-23T20:13:55Z
dc.date.issued	2025-01-23
dc.date.issued	2025-01-23
dc.date.submitted	2024
dc.description	Thesis (Ph.D.)--University of Washington, 2024
dc.description.abstract	Scientific interest in microbiomes (communities of microscopic organisms in a given environment) has recently expanded due to the growing understanding of the role of the microbiome in human and environmental health, and in conjunction with the decreasing costs of metagenomic sequencing. However, there are several complications of the data that we observe from sequencing microbial samples that preclude the use of off-the-shelf statistical methods. Therefore, there is a high demand for statistical methods that are tailored to address scientific questions about microbiomes while accounting for relevant features of how the data are collected and processed. These methods must also be feasible and computationally efficient for the large scale of data that metagenomic sequencing produces. In my first project, I present a visualization method to compare estimated gene-level evolutionary histories to estimated genome-level evolutionary histories. Evolutionary histories are best represented by phylogentic trees, which are complex graph objects made up of nodes that represent biological categories, referred to as taxa, and edges that represent the evolutionary relationships between taxa. I use a local linear approximation of phylogenetic tree space to visualize estimated gene trees as points in a low-dimensional Euclidean space. I demonstrate the utility of my proposed visualization approach through two microbial data analyses. This visualization approach is scalable for large sets of gene trees that encode a large number of taxa. Next, I present another computationally scalable method for the analysis of metagenomic sequencing data. I extend the method of Clausen and Willis for taxonomic differential abundance analysis in order to make it computationally efficient for datasets with thousands of taxa. Through simulation, I demonstrate that my scalable method achieves similar Type I error rate control and power to the original method, and through data analyses I demonstrate that the two methods lead to very similar differential abundance conclusions. The differential abundance estimand in my method is defined with respect to a small set of reference taxa, and I suggest several approaches to choosing such a set and investigate how these approaches affect estimates and inference results through simulation and in a small data analysis. In my third project, I consider differential abundance analyses of molecular functions. I propose a novel functional abundance model, and show that in this model, the identifiable differential abundance parameter is a function of both biological parameters and unknown sequencing effects. I develop a framework to simulate data under my functional abundance model, and use this framework to study how different magnitudes of sequencing effects affect estimation and inference of these differential abundance parameters, relative to the true biological fold-differences in abundance that are scientifically relevant. In these simulations, I find that inference on the identifiable differential abundance parameter cannot reliably be used to draw conclusions about biological fold-differences in abundance, especially in the presence of sequencing effects with large magnitudes. To address this, I suggest careful interpretation of results from the differential abundance analysis of functional data in terms of a parameter that combines biological signal with sequencing artifacts. As a whole this dissertation presents three methods that address complex scientific questions with applications to microbiome science, each of which accounts for the effects of sequencing on microbiome data and is computationally efficient for the large scale of a typical metagenomic dataset.
dc.embargo.terms	Open Access
dc.format.mimetype	application/pdf
dc.identifier.other	Teichman_washington_0250E_27686.pdf
dc.identifier.uri	https://hdl.handle.net/1773/52874
dc.language.iso	en_US
dc.rights	none
dc.subject	Statistics
dc.subject.other	Statistics
dc.title	Scalable statistical methods for microbial metagenomics
dc.type	Thesis

Files

Original bundle

Now showing 1 - 1 of 1

Name:: Teichman_washington_0250E_27686.pdf
Size:: 12.26 MB
Format:: Adobe Portable Document Format

Download

Collections

Statistics