Measurement Error in Microbiome Sequencing Experiments: Statistical and Scientific Considerations
Loading...
Date
Authors
Clausen, David
Journal Title
Journal ISSN
Volume Title
Publisher
Abstract
Next-generation sequencing (NGS)methods have become an essential tool in the
study of complex microbial communities known
as microbiomes. Because their near ubiquity, such
communities have been the focus of substantial
research aiming to elucidate their structure and
yield new insights into public health, medicine, and agriculture,
among other fields.
However, the relationship between the true composition of biological samples
on which sequencing is performed and sequencing output
is complex and only partially understood. As a result,
it is often unclear to what extent experimental
results in microbiome science reflect underlying biology
rather than technical artifacts of a complex measurement
process. To address this uncertainty, we analyze a large NGS dataset
generated by a multi-laboratory study of measurement error in microbiome
sequencing data. We find, in replicate measurements on identical biological
specimens, that distinctions between specimens apparent in measurements
taken by one laboratory are not reliably resolved in measurements taken by
others, with the degree of discordance varying with the
taxonomic level and scale at which distinctions are made. Hence, our
finding suggests that comparisons across groups in microbiome
studies may not dependably reflect biology. We next present a statistical model
appropriate for NGS data
subject both to detection effects -- multiplicative
over- and under-detection of microbial taxa relative to their
true abundances -- and to
potential contamination by taxa not present in specimens of interest.
Our model uses experimental
covariates and
measurements on communities of known
composition (also called positive controls)
to estimate community composition in specimens of
interest as well as detection effects and the form
and intensity of contamination. We show via analysis
of real datasets as well as through simulation that
this model substantially outperforms standard
estimators of microbial relative abundance in data subject to
detection effects and contamination. In particular, we demonstrate
that our model can exploit the structure of dilution series
experiments to accurately identify contamination, even
in the absence of positive control measurements. However,
the same is not true for detection effects, which in
general can only be estimated among microbial taxa
present in communities of known composition. To address this limitation, we develop
a log-linear model to estimate means of
outcomes observed up to unknown sample-specific
scalings and subject to detection effects, taking as our motivating example
estimation on the basis of NGS data of differences in log mean
microbial cell concentrations across covariates of interest.
The presence of unknown scalings
renders our estimand only partially identiable.
We address this by imposing simple constraints,
which may be modified to suit differing scientific
contexts. We validate this model via simulations
and illustrate its use with a whole-genome-sequencing
dataset collated from multiple studies associations between
colorectal cancer and the human gut microbiome. Taken together, this work identifies measurement error as a
key consideration in the design, analysis, and interpretation of
microbiome sequencing experiments, and in addition provides
novel statistical methods to
characterize and account for this error.
Description
Thesis (Ph.D.)--University of Washington, 2022
