Cluster analysis of gene expression data
The invention of DNA microarrays allows us to study simultaneous variations of genes at the genome-wide scale. A typical gene expression data set consists of thousands or even tens of thousands of genes, and a few dozens experiments. Cluster analysis is the art of finding groups in a given data set such that objects in the same group are similar to each other while objects in different groups are dissimilar. There are many applications for clustering gene expression data.Many different clustering algorithms and analytical techniques have been applied to gene expression data. Success of various analytical methodologies in specific instances has been reported, but extensive quantitative evaluations of clustering methodologies are rare. Since different analytical approaches may produce different clustering results, there is a great need to evaluate clustering techniques in order to choose an appropriate approach. An underlying theme of this dissertation is systematic evaluations of clustering methodologies on gene expression data. Specifically, we proposed a data-driven methodology, called the figure of merit (FOM) methodology, to compare the quality of clusters from heuristic-based clustering algorithms. We also showed that the model-based clustering approach, which assumes the Gaussian mixture model, produces relatively high quality clusters. The probabilistic framework in the model-based approach allows us to infer the correct number of clusters, and to compare different models. Moreover, we investigated the effectiveness of a dimension reduction technique called principal component analysis as a pre-processing step before cluster analysis.Our main contributions are evaluation methodologies of analytical techniques in clustering gene expression data. We employed an external validation approach, which evaluates clustering results by comparing to external prior knowledge of the data, to assess the performance of internal validation approaches, which do not require any external knowledge of the data. In particular, we showed that our FOM methodology and the model-based approach, which do not require any external knowledge of the data, produce comparisons of clustering algorithms that are consistent with comparisons to external knowledge. Since external knowledge is seldom available for gene expression data, our work provides practical evaluation frameworks for assessing clustering results on gene expression data.