Surrogate variable analysis

dc.contributor.authorLeek, Jeffrey Tullisen_US
dc.date.accessioned2009-10-07T00:01:29Z
dc.date.available2009-10-07T00:01:29Z
dc.date.issued2007en_US
dc.descriptionThesis (Ph. D.)--University of Washington, 2007.en_US
dc.description.abstractModern high-throughput molecular biology experiments measure data for thousands of related features and seek to rank those features for association with some variables of experimental or clinical importance. The process of ranking features for association with primary variables is complicated by genetic, environmental, and technical factors that influence hundreds or thousands of features at a time. In highdimensional experiments these factors are often unknown, unmeasured, or incapable of being tractably modeled. Consistent patterns of variation across features due to unmeasured or unmodeled factors can confound the relationship between the primary variables and the measured features. In this thesis we provide a statistical framework for modeling large-scale noise dependence caused by unmeasured or unmodeled factors in high-throughput data. We argue that estimating the sources of noise dependence is more appropriate than estimating the pairwise covariance between all features when the number of features is large. A direct connection is made with the well-studied problem of multiple testing dependence, which typically focuses on the distribution of P-values from multiple testing procedures. We introduce the concept of surrogate variables, estimable linear combinations of the true unmeasured or unmodeled factors causing noise dependence, that can be included when modeling the relationship between the primary variables and the feature level data. We also propose algorithms for estimating surrogate variables based on principal component analysis of relevant subsets of features. Under certain conditions accounting for the estimated surrogate variables asymptotically corrects the ranking and error rate estimation in high-throughput data analysis. We also discuss pathological situations when surrogate variables can not be estimated. To illustrate the power of this approach, we apply our estimates of the surrogate variables to improve reproducibility in a large clinical gene expression study of trauma related outcomes.en_US
dc.format.extentvii, 122 p.en_US
dc.identifier.otherb59693265en_US
dc.identifier.other236108525en_US
dc.identifier.otherThesis 57853en_US
dc.identifier.urihttp://hdl.handle.net/1773/9586
dc.language.isoen_USen_US
dc.rightsCopyright is held by the individual authors.en_US
dc.rights.urien_US
dc.subject.otherTheses--Biostatisticsen_US
dc.titleSurrogate variable analysisen_US
dc.typeThesisen_US

Files

Original bundle

Now showing 1 - 1 of 1
Loading...
Thumbnail Image
Name:
3290558.pdf
Size:
7.17 MB
Format:
Adobe Portable Document Format

Collections