Image analysis and signal extraction from cDNA microarrays
The emergence of microarray technology invariably leads to a discussion about data reliability amongst researchers. Many factors impact the accuracy of gene expression data gleaned from microarray experiments. These factors range from noise inherent in the technology platform to variation arising out of experimental design. High level sources of variation encompass deviations that derive from the experimental design while low level sources of variation encompass the noise due to technological errors and biases in the lab. Although so-called high level sources of variation are dominant in microarray data most of the time, these sources can be over-shadowed by data errors at the technological level. That is, although differences in tissue sampled will likely cause most variation, images with artifact noise or dust covering information will taint small, but potentially crucial, sections of the dataset.Because variation due to experimental design is well-covered territory in statistical research, the focus of this dissertation is at the low level of variation. The means to correct for sources of technological variation is not obvious to genomics specialists and statisticians. The research presented here explains the causes for cDNA microarray data variability and methods to account for the low level variance at two points: (1) image analysis and (2) signal extraction. The image analysis takes a TIFF image and performs grid alignment, spot detection, background estimation, flagging and outputs the information to a text file. The image analysis routine to be outlined herein is automated, reproducible, and robust.Signal extraction involves the modeling of spot pixel data to describe the overall spot intensity level and a measure of spot reliability while incorporating both red and green channels from the experiment. The spot quality measure will be spot-specific and continuous such that each data point in a set of experiments has an assigned data reliability weight. This quality measure can then be used to downweight low quality data in a regression-type analysis. In this way, spots that are tainted with artifact noise, and therefore have inaccurate expression levels, do not mar downstream analysis. A spot quality measure is also better than a flag, as summarily removing flagged data results in missing data problems. But using a spot quality weight does not result in missing data and may improve efficiency in test statistics.A wide variety of methods to describe spot level quality estimates were investigated. The examination included several ways to incorporate spatial structure between pixel pairs. Semi-parametric methods to describe the variance of spots were not estimable for this data structure in the absence of spot replication. If updates to microarray technology protocols include spot replication, then semi-parametric measures can be revisited. Smoothers to describe correlation required a priori knowledge of the correlation size in order to adjust bandwidths resulting in a circularity problem. Ultimately, a fully parametric and a fully non-parametric estimate to describe quality are introduced and shown to be feasible for a data reliability model.
- Biostatistics