Hypothesis Testing With High-Dimensional Data
In the past two decades, vast high-dimensional biomedical datasets have become mainstay in various biomedical applications from genomics to neuroscience. These high-dimensional data enable researchers to answer scientific questions that are impossible to answer with classical, low-dimensional datasets. However, due to the “curse of dimensionality”, such high-dimensional datasets also pose serious statistical challenges. Motivated by these emerging applications, statisticians have devoted much effort to developing estimation methods for high-dimensional linear models and graphical models. However, there is still little progress on quantifying the uncertainty of the estimates, e.g., obtaining p-values and confidence intervals, which are crucial for drawing scientific conclusions. While encouraging advances have been made in this area over the past couple of years, the majority of existing high-dimensional hypothesis testing methods still suffer from low statistical power or high computational intensity. In this dissertation, we focus on developing hypothesis testing methods for high-dimensional linear and graphical models. In Chapter 2, we investigate a naive and simple two-step hypothesis testing procedure for linear models. We show that, under appropriate conditions, such a simple procedure controls type-I error rate, and is closely connected to more complicated alternatives. We also show in numerical studies that such a simple procedure achieves similar performance as procedures that are computationally more intense. In Chapter 3, we consider hypothesis testing for linear regression that incorporates external information about the relationship between variables represented by a graph, such as the gene regulatory network. We show in theory and numerical studies that by incorporating informative external information, our proposal is substantially more powerful than existing methods that ignore such information. We also propose a more robust procedure for settings where the external information is potentially inaccurate or imprecise. This robust procedure could adaptively choose the amount of external information to be incorporated based on the data. In Chapter 4, we shift our focus to Gaussian graphical models. We propose a novel procedure to test whether two Gaussian graphical models share the same edge set, while controlling the false positive rate. In the case that two networks are different, our proposals could identify specific nodes and edges that show differential connectivity. In this chapter, we also demonstrate that when the goal is to identify differentially connected nodes and edges, the results from our proposal are more interpretable than existing procedures based on covariance or precision matrices. We finish the dissertation with a discussion in Chapter 5, in which we present viable future research directions, and discuss a possible extension of our proposals to vector autoregression models for time series.
- Biostatistics