Statistical Hurdle Models for Single Cell Gene Expression: Differential Expression and Graphical Modeling
MetadataShow full item record
This dissertation describes a set of statistical methods developed for analysis of single cell gene expression. A characteristic of single cell expression is bimodal expression, in which two clusters of expression are present. In any given transcript, the null cluster corresponds to cells without detectable expression (hence a non-zero measurement reflects measurement error) while the signal cluster contains cells with a positive, detectable level of expression. Statistical models that accommodate this characteristic are considered. • In Chapter 1, motivation and history of single cell gene expression is considered. Scientific and statistical questions addressable through single cell expression are discussed, and some statistical frameworks for bulk and single cell expression are described. • In Chapter 2, I consider data generated from replicates of single cells and 100 cell aggregates that were assayed through single cell reverse-transcriptase qPCR (rt-qPCR). In rt-qPCR the null cluster manifests as bona-fide zeros, so expression is characterized by zero-inflation of otherwise continuous values. The average expression from single cells and 100-cell replicates is compared to develop quality control metrics that optimize the single-cell, 100-cell concordance. A Hurdle model is proposed, which accounts for the fact that genes at the single-cell level can be on (and a continuous expression measure is recorded) or dichotomously off (and the recorded expression is zero). Based on this model, I derive a combined likelihood-ratio test for differential expression that incorporates both the discrete and continuous components. This chapter was originally published in McDavid et al. . • In Chapter 3, I consider application of the hurdle model to single cell RNA sequencing (scRNAseq). In these technologies, the binary zero-inflation described found in rt- qPCR-based assays manifests itself as continuous, bimodal expression, motivating a clustering and thresholding procedure to assign expression to a cluster. The Hurdle model, extended and cast as a vector generalized linear model (vGLM), is provided as an R package named MAST. The cellular detection rate (CDR) is defined as the number of expressed genes found in a cell. It is identified as an important latent factor in single cell experiments, and is argued to measure size and efficiency variations among cells. Gene set enrichment analysis using the Hurdle model, and use of residuals defined through such models are discussed. Parts of this chapter were originally published in Finak et al. , McDavid et al. . • In Chapter 4, the Hurdle model is generalized to model multivariate dependences between cells, permitting the parametrization of graphical models. A neighborhood selection-based method is proposed to leverage group-l1 penalized regression. Networks estimated on single-cell and multi-cell experiments are contrasted and found to be very distinct. In order to synthesize graphs estimated on transcriptome-scale data, a test for enrichment of connections between and within gene ontology categories is proposed.
- Statistics