Bicluster-Based Identification of Gene Sets Through Multivariate Meta-Analysis (MVMA)
Omics technologies are among the most exciting developments in biology and medicine in recent decades. They offer a whole new way of investigating a sample or a patient by taking comprehensive molecular-level snapshots. These snapshots, in the form of massive amount of data, provide important hints about the pathophysiological state of the target. Despite the promises of the omics technologies, their usefulness hinges upon proper translation of the data into knowledge. This dissertation is focused on mining of public gene expression data to discover gene sets that may be parts of biological pathways. It tries to answer these two overall questions: (1) what is the data mining method best suited for finding gene sets? (2) how best to utilize multiple datasets in order to increase statistical strength? Biclustering has been proven to be highly effective for identifying gene sets. Compared to traditional clustering methods, biclustering recognizes a list of genes that are up- or down-regulated under a subset of the conditions, as opposed to the whole spectrum of the conditions. A large number of biclustering algorithms have been applied to analysis of gene expression data. Condition-dependent Correlation Subgroups (CCS), as one of these algorithms, is chosen for the current study. Identifying individual biclusters using CCS is the task of Aim 1. Most public expression datasets have relatively small sample sizes. Making inference on these datasets may be error prone, which motivates the use of multiple datasets to increase the statistical power. This study makes use of multiple related datasets by adapting the approach of meta-analysis. More specifically, a group of biclusters, each coming from a separate dataset, are identified. Meta-analysis is then applied to these biclusters. Hence, the biclusters are analogous to the individual studies in a traditional meta-analysis. The goal is to identify a gene set, through combining the evidence in the individual biclusters. Since each gene in this group of bicluster is modeled as an endpoint (equivalent to outcome in traditional meta-analysis context), and the correlations among the endpoints are taken into consideration, the approach of multivariate meta-analysis (MVMA) is taken. Using MVMA to combine biclusters from separate datasets is the focus of Aim 2. Despite the fact that biclustering has significantly reduced the dimension, analyzing the stack still faces the difficulty of high dimensionality (p) and small number of available datasets (n), which is the well-known p >> n problem. The traditional MVMA methods, either within the Bayesian or the Frequentist framework, are not effective when p is over 50. Since a typical bicluster stack has a dimension in the range of 70 - 150, it renders the traditional methods impractical in the current context. A previous study by Jackson and Riley  proposed an interesting two-step procedure for MVMA to tackle the issue of data scarcity. It involves estimation of the between-study covariance matrix as the step 1, following by making inference about the overall effect sizes as the step 2. In step 2, multivariate t rather than normal distribution is used in order to take the uncertainty of the between-study variance estimate into account. Jackson’s method is implemented and tested in the current study. Unfortunately, it is found to be still slow for moderate or high dimensions, mainly because of method of moments (MM) used in step 1. To overcome this constraint, an alternative step 1 method is proposed, which involves using weighted sample covariance matrix, subject to matrix regularization, to approximate the between-study variance/covariance. A series of simulation studies have shown that the improved two-step procedure performs favorably compared to the traditional MVMA methods as well as Jackson’s original routine. Given these results, the new two-step procedure is applied to analysis of real bicluster stacks, which leads to a series of candidate gene sets. The candidate gene sets are then analyzed in Aim 3 by enrichment-based analyses using public pathway knowledge bases. The specific methods used include Over Representation Analysis (ORA), Gene Set Enrichment Analysis (GSEA), and Network Topology-based Analysis (NTA). A key finding is that high-certainty effect size estimates derived from MVMA are often associated with significant enrichment results from the pathway analysis, especially when the size of bicluster stack is big enough. In other words, effect size estimates are predictive of the biological relevance of the gene sets, which is perhaps the most significant result of the current study.