Surrogate variable analysis
Modern high-throughput molecular biology experiments measure data for thousands of related features and seek to rank those features for association with some variables of experimental or clinical importance. The process of ranking features for association with primary variables is complicated by genetic, environmental, and technical factors that influence hundreds or thousands of features at a time. In highdimensional experiments these factors are often unknown, unmeasured, or incapable of being tractably modeled. Consistent patterns of variation across features due to unmeasured or unmodeled factors can confound the relationship between the primary variables and the measured features. In this thesis we provide a statistical framework for modeling large-scale noise dependence caused by unmeasured or unmodeled factors in high-throughput data. We argue that estimating the sources of noise dependence is more appropriate than estimating the pairwise covariance between all features when the number of features is large. A direct connection is made with the well-studied problem of multiple testing dependence, which typically focuses on the distribution of P-values from multiple testing procedures. We introduce the concept of surrogate variables, estimable linear combinations of the true unmeasured or unmodeled factors causing noise dependence, that can be included when modeling the relationship between the primary variables and the feature level data. We also propose algorithms for estimating surrogate variables based on principal component analysis of relevant subsets of features. Under certain conditions accounting for the estimated surrogate variables asymptotically corrects the ranking and error rate estimation in high-throughput data analysis. We also discuss pathological situations when surrogate variables can not be estimated. To illustrate the power of this approach, we apply our estimates of the surrogate variables to improve reproducibility in a large clinical gene expression study of trauma related outcomes.
- Biostatistics