Measurement Error in Microbiome Sequencing Experiments: Statistical and Scientific Considerations

Clausen, David

Measurement Error in Microbiome Sequencing Experiments: Statistical and Scientific Considerations

Files

Clausen_washington_0250E_24904.pdf (3.53 MB)

Date

2023-01-21

relationships.isAuthorOf

Clausen, David

Abstract

Next-generation sequencing (NGS)methods have become an essential tool in the study of complex microbial communities known as microbiomes. Because their near ubiquity, such communities have been the focus of substantial research aiming to elucidate their structure and yield new insights into public health, medicine, and agriculture, among other fields. However, the relationship between the true composition of biological samples on which sequencing is performed and sequencing output is complex and only partially understood. As a result, it is often unclear to what extent experimental results in microbiome science reflect underlying biology rather than technical artifacts of a complex measurement process. To address this uncertainty, we analyze a large NGS dataset generated by a multi-laboratory study of measurement error in microbiome sequencing data. We find, in replicate measurements on identical biological specimens, that distinctions between specimens apparent in measurements taken by one laboratory are not reliably resolved in measurements taken by others, with the degree of discordance varying with the taxonomic level and scale at which distinctions are made. Hence, our finding suggests that comparisons across groups in microbiome studies may not dependably reflect biology. We next present a statistical model appropriate for NGS data subject both to detection effects -- multiplicative over- and under-detection of microbial taxa relative to their true abundances -- and to potential contamination by taxa not present in specimens of interest. Our model uses experimental covariates and measurements on communities of known composition (also called positive controls) to estimate community composition in specimens of interest as well as detection effects and the form and intensity of contamination. We show via analysis of real datasets as well as through simulation that this model substantially outperforms standard estimators of microbial relative abundance in data subject to detection effects and contamination. In particular, we demonstrate that our model can exploit the structure of dilution series experiments to accurately identify contamination, even in the absence of positive control measurements. However, the same is not true for detection effects, which in general can only be estimated among microbial taxa present in communities of known composition. To address this limitation, we develop a log-linear model to estimate means of outcomes observed up to unknown sample-specific scalings and subject to detection effects, taking as our motivating example estimation on the basis of NGS data of differences in log mean microbial cell concentrations across covariates of interest. The presence of unknown scalings renders our estimand only partially identiable. We address this by imposing simple constraints, which may be modified to suit differing scientific contexts. We validate this model via simulations and illustrate its use with a whole-genome-sequencing dataset collated from multiple studies associations between colorectal cancer and the human gut microbiome. Taken together, this work identifies measurement error as a key consideration in the design, analysis, and interpretation of microbiome sequencing experiments, and in addition provides novel statistical methods to characterize and account for this error.