Surrogate variable analysis

Leek, Jeffrey Tullis

Surrogate variable analysis

dc.contributor.author	Leek, Jeffrey Tullis	en_US
dc.date.accessioned	2009-10-07T00:01:29Z
dc.date.available	2009-10-07T00:01:29Z
dc.date.issued	2007	en_US
dc.description	Thesis (Ph. D.)--University of Washington, 2007.	en_US
dc.description.abstract	Modern high-throughput molecular biology experiments measure data for thousands of related features and seek to rank those features for association with some variables of experimental or clinical importance. The process of ranking features for association with primary variables is complicated by genetic, environmental, and technical factors that influence hundreds or thousands of features at a time. In highdimensional experiments these factors are often unknown, unmeasured, or incapable of being tractably modeled. Consistent patterns of variation across features due to unmeasured or unmodeled factors can confound the relationship between the primary variables and the measured features. In this thesis we provide a statistical framework for modeling large-scale noise dependence caused by unmeasured or unmodeled factors in high-throughput data. We argue that estimating the sources of noise dependence is more appropriate than estimating the pairwise covariance between all features when the number of features is large. A direct connection is made with the well-studied problem of multiple testing dependence, which typically focuses on the distribution of P-values from multiple testing procedures. We introduce the concept of surrogate variables, estimable linear combinations of the true unmeasured or unmodeled factors causing noise dependence, that can be included when modeling the relationship between the primary variables and the feature level data. We also propose algorithms for estimating surrogate variables based on principal component analysis of relevant subsets of features. Under certain conditions accounting for the estimated surrogate variables asymptotically corrects the ranking and error rate estimation in high-throughput data analysis. We also discuss pathological situations when surrogate variables can not be estimated. To illustrate the power of this approach, we apply our estimates of the surrogate variables to improve reproducibility in a large clinical gene expression study of trauma related outcomes.	en_US
dc.format.extent	vii, 122 p.	en_US
dc.identifier.other	b59693265	en_US
dc.identifier.other	236108525	en_US
dc.identifier.other	Thesis 57853	en_US
dc.identifier.uri	http://hdl.handle.net/1773/9586
dc.language.iso	en_US	en_US
dc.rights	Copyright is held by the individual authors.	en_US
dc.rights.uri		en_US
dc.subject.other	Theses--Biostatistics	en_US
dc.title	Surrogate variable analysis	en_US
dc.type	Thesis	en_US

Files

Original bundle

Now showing 1 - 1 of 1

Name:: 3290558.pdf
Size:: 7.17 MB
Format:: Adobe Portable Document Format

Download

Collections

Biostatistics