Scalable statistical methods for microbial metagenomics
| dc.contributor.advisor | Willis, Amy | |
| dc.contributor.author | Teichman, Sarah | |
| dc.date.accessioned | 2025-01-23T20:13:55Z | |
| dc.date.available | 2025-01-23T20:13:55Z | |
| dc.date.issued | 2025-01-23 | |
| dc.date.issued | 2025-01-23 | |
| dc.date.submitted | 2024 | |
| dc.description | Thesis (Ph.D.)--University of Washington, 2024 | |
| dc.description.abstract | Scientific interest in microbiomes (communities of microscopic organisms in a given environment) has recently expanded due to the growing understanding of the role of the microbiome in human and environmental health, and in conjunction with the decreasing costs of metagenomic sequencing. However, there are several complications of the data that we observe from sequencing microbial samples that preclude the use of off-the-shelf statistical methods. Therefore, there is a high demand for statistical methods that are tailored to address scientific questions about microbiomes while accounting for relevant features of how the data are collected and processed. These methods must also be feasible and computationally efficient for the large scale of data that metagenomic sequencing produces. In my first project, I present a visualization method to compare estimated gene-level evolutionary histories to estimated genome-level evolutionary histories. Evolutionary histories are best represented by phylogentic trees, which are complex graph objects made up of nodes that represent biological categories, referred to as taxa, and edges that represent the evolutionary relationships between taxa. I use a local linear approximation of phylogenetic tree space to visualize estimated gene trees as points in a low-dimensional Euclidean space. I demonstrate the utility of my proposed visualization approach through two microbial data analyses. This visualization approach is scalable for large sets of gene trees that encode a large number of taxa. Next, I present another computationally scalable method for the analysis of metagenomic sequencing data. I extend the method of Clausen and Willis for taxonomic differential abundance analysis in order to make it computationally efficient for datasets with thousands of taxa. Through simulation, I demonstrate that my scalable method achieves similar Type I error rate control and power to the original method, and through data analyses I demonstrate that the two methods lead to very similar differential abundance conclusions. The differential abundance estimand in my method is defined with respect to a small set of reference taxa, and I suggest several approaches to choosing such a set and investigate how these approaches affect estimates and inference results through simulation and in a small data analysis. In my third project, I consider differential abundance analyses of molecular functions. I propose a novel functional abundance model, and show that in this model, the identifiable differential abundance parameter is a function of both biological parameters and unknown sequencing effects. I develop a framework to simulate data under my functional abundance model, and use this framework to study how different magnitudes of sequencing effects affect estimation and inference of these differential abundance parameters, relative to the true biological fold-differences in abundance that are scientifically relevant. In these simulations, I find that inference on the identifiable differential abundance parameter cannot reliably be used to draw conclusions about biological fold-differences in abundance, especially in the presence of sequencing effects with large magnitudes. To address this, I suggest careful interpretation of results from the differential abundance analysis of functional data in terms of a parameter that combines biological signal with sequencing artifacts. As a whole this dissertation presents three methods that address complex scientific questions with applications to microbiome science, each of which accounts for the effects of sequencing on microbiome data and is computationally efficient for the large scale of a typical metagenomic dataset. | |
| dc.embargo.terms | Open Access | |
| dc.format.mimetype | application/pdf | |
| dc.identifier.other | Teichman_washington_0250E_27686.pdf | |
| dc.identifier.uri | https://hdl.handle.net/1773/52874 | |
| dc.language.iso | en_US | |
| dc.rights | none | |
| dc.subject | Statistics | |
| dc.subject.other | Statistics | |
| dc.title | Scalable statistical methods for microbial metagenomics | |
| dc.type | Thesis |
Files
Original bundle
1 - 1 of 1
Loading...
- Name:
- Teichman_washington_0250E_27686.pdf
- Size:
- 12.26 MB
- Format:
- Adobe Portable Document Format
