Addressing double dipping through selective inference and data thinning

Neufeld, Anna

Addressing double dipping through selective inference and data thinning

Files

Neufeld_washington_0250E_25447.pdf (12.51 MB)

Date

2023-08-14, 2023-08-14, 2023-08-14

Authors

Neufeld, Anna

Abstract

While classical statistical methods assume that we only ever test pre-specified hypotheses about pre-specified models, the reality is that scientists often explore their data before coming up with models and hypotheses of interest. We refer to the practice of using the same data to generate and then test a hypothesis, or to fit and then evaluate a model, as double dipping. Problems arise when standard statistical procedures for testing hypotheses or evaluating models are applied in settings that involve double dipping. To circumvent the challenges associated with double dipping, one can take two possible approaches. The first approach is to develop specialized statistical procedures that account for double dipping. The second approach is to avoid double dipping by conducting hypothesis generation and hypothesis testing (or model fitting and model evaluation) on independent datasets. When we only have access to one dataset, we typically accomplish this via sample splitting, in which we split the observations in our dataset into two smaller datasets, such that one can be used for hypothesis generation or model fitting, and the second can be used for hypothesis testing or model evaluation. The first portion of this thesis proposes a selective inference framework for con- ducting inference after fitting a regression tree. Selective inference frameworks allow us to generate and test a null hypothesis using the same data by conditioning on the event that the data led us to select a given null hypothesis. The second portion of this thesis is motivated by problems that arise in the analysis of single-cell RNA sequencing data. In the analysis of single-cell RNA sequencing data, scientists often first use their data to estimate latent variables, and then wish to either evaluate these latent variable models or use these estimated latent variables for downstream inference. The pipelines used for latent variable estimation are very complex, and so developing specialized procures to avoid double dipping in this setting would be very difficult. Furthermore, these are unsupervised problems for which sample splitting is not an option: estimating latent variable coordinates for half of the observations does not yield latent variable coordinates for the remaining observations, and thus no downstream evaluation or inference can be performed. To address these challenges, we propose Poisson count splitting, which splits a single observation in a dataset into two components, which are independent under a Poisson assumption. We show that Poisson count splitting provides an alternative to sample splitting that allows us to avoid double dipping in unsupervised settings. As single-cell RNA sequencing data is often thought to be overdispersed relative to the Poisson distribution, we next propose negative binomial count splitting, which allows us to avoid double dipping under a more realistic and more general negative binomial assumption. In the final portion of this thesis, we generalize the count splitting framework to a variety of distributions, and refer to the generalized framework as data thinning. Data thinning is a very general alternative to sample splitting that is useful far beyond the context of single-cell RNA sequencing data, and, unlike sample splitting, can be applied in both supervised and unsupervised settings.