Statistical methods for analyzing genomic data with consideration of spatial structures
High-dimensional genetic data, such as DNA copy number and single nucleotide polymorphism (SNP), enable researchers unprecedented capabilities for studying genetic basis of diseases. In this dissertation, we develop statistical methods for analyzing two types of high-dimensional genomic data with consideration of spatial structures. In part one, we consider detecting DNA copy number changes using multi-scaled wavelet transformation. Genomic instability, such as copy number losses and gains, occurs in many genetic diseases. Studies of such genomic instability can help us understand the underlying mechanism of disease occurrences and progression. Array-based Comparative Genomic Hybridization (array-CGH) is a powerful technology for measuring copy numbers at thousands of loci simultaneously. We propose a wavelet-based non-parametric approach for detecting copy number changes. The maximum of 2-scale wavelet products across scales, as a novel test statistic, is motivated by combining information across scales to improve power. We explore two non-parametric approaches for estimating the null distribution, including permuting wavelet coefficients at finest scale and permuting residuals after lowess smoothing. Adjusted p-values are estimated using step-down maxT permutation algorithm by controlling the family-wise error rate. To avoid the false positives caused by autocorrelations between adjacent wavelet coefficients, we propose to test locations at which only local maxima occur. Finally, we illustrate our method using two real data sets and perform a simulation study to investigate the finite sample performance of our method compared with two existing methods---a sequential testing method and a model selection method.In part two, we consider genetic association studies with tightly linked SNP markers using family data. The transmission/disequilibrium test (TDT) is a popular approach in assessing the linkage disequilibrium between a marker locus and candidate disease locus. To account for local dependency in the presence of phase ambiguity, the TDT has been extended to multiple tightly linked markers by constructing haplotypes statistically. As an alternative, we propose a locally weighted TDT approach that weighs the contribution of multiple SNPs within a prespecified neighborhood according to their association with the locus of interest. We illustrate our method using GAW14 data.
- Biostatistics