Statistical Methods for Sparse Binary, Count Data and Treatment Effect Heterogeneity

Xie, Yuxiang

Statistical Methods for Sparse Binary, Count Data and Treatment Effect Heterogeneity

Files

Xie_washington_0250E_20176.pdf (1.94 MB)

Date

2019-08-14

relationships.isAuthorOf

Xie, Yuxiang

Abstract

The concept of `sparsity' is common to see in many topics of statistics. `Sparsity' is a double-edged sword, depending on the statistical context. Sometimes, sparsity brings convenience; for example, a sparse statistical model is one having only a small number of nonzero parameters, which is easier to interpret than a dense model. On the other hand, sparsity may cause troubles; for example, a sparse sequencing read count table contains excessive zeros due to the issue that many rare bacterial taxa are not captured in the sequencing reads, and this sparsity may lead to inaccurate estimates of bacterial abundances. This dissertation focuses on developing statistical methodologies for dealing with sparsity problems in three different statistical topics. We first present a false discovery rate (FDR) controlled variable selection method for a sparse model with binary covariates. We show that our proposal controls FDR under a pre-specied level in a finite sample and achieves asymptotic power equal to one under some mild assumptions. Next, we consider a sparse generalized linear model for studying treatment effect heterogeneity, and we propose two statistical frameworks that can detect factors contributing to heterogeneous treatment effect, and simultaneously control FDR. Finally, we develop a statistical method based on non-negative matrix factorization (NMF) for estimating bacterial compositions from sparse count data in microbiome studies. We establish upper bounds of estimation error for our NMF estimators and show in simulation studies that our proposal outperforms some existing methods in various settings. We also demonstrate the interpretability of our model in a real data application.