Statistics
Permanent URI for this collectionhttps://digital.lib.washington.edu/handle/1773/4971
Browse
Recent Submissions
Item type: Item , Spatio-Temporal Statistical Inference for Human Mobility Using GPS Data(2026-04-20) Wu, Haoyang; Dobra, Adrian; Chen, Yen-ChiUnderstanding where individuals spend their time over space and time is a central question in the study of human mobility. The increasing availability of high-resolution GPS data provides unprecedented opportunities to address this question, but also poses substantial statistical challenges arising from measurement error, heterogeneous sampling frequencies, and complex temporal structure. This dissertation develops a unified spatio-temporal statistical framework for modeling and estimating interpretable summaries of long-term human mobility from GPS data. At the core of the framework is a stochastic representation of daily mobility patterns, in which GPS observations are viewed as noisy measurements of latent spatio-temporal movement processes. Within this data-generating view, key inferential targets are formulated as time-allocation functionals that quantify the proportion of time individuals spend in different spatial regions. Estimation procedures are constructed by combining time-weighted representations of observed locations with aggregation across days, yielding activity-related summaries with well-defined statistical properties. This approach shifts attention from trajectory reconstruction to the principled estimation of time allocation over space. The central inferential construct emerging from this modeling strategy is the activity space, defined as a time-weighted characterization of routine spatial behavior. Rather than treating movement paths as primary objects of analysis, activity spaces are derived as functionals of latent daily processes, allowing for coherent inference under realistic measurement conditions. The framework accommodates multiple spatial supports, including continuous domains, geometrically constrained environments, and aggregated regional contexts. The dissertation consists of three complementary main chapters. Chapter 2 establishes the foundational modeling and estimation framework for daily mobility processes and derives statistical properties for time-proportion estimators. Chapter 3 extends this framework to polygon-network representations, incorporating geometric constraints into both modeling and inference. Chapter 4 integrates the resulting mobility summaries into applied analysis, demonstrating how time-weighted activity measures can be combined with external spatial information to study contextual exposure in public health settings. Together, these contributions provide a coherent model-based approach to spatio-temporal inference on human mobility that links data generation, estimation, spatial representation, and scientific application within a unified statistical framework.Item type: Item , Statistical Methods toward Trustworthy AI: From Diagnosis to Controllability and Societal Impact(2026-02-05) Fisher, Jillian; Richardson, Thomas; Choi, YejinThis dissertation examines three core dimensions of Trustworthy AI: diagnosis, control, and societal impact, using statistical and machine learning methods. While the rapid advancement of large-scale AI has led to widespread adoption in everyday life, research into its reliability, safety, and social implications remains nascent. To address these gaps, this dissertation develops both theoretical foundations and practical methodologies for building more reliable AI systems.Part I (Diagnosis) provides finite-sample statistical and computational guarantees for influence diagnostics. Specifically, Chapter 2 introduces finite-sample statistical bounds, as well as computational complexity bounds, for influence functions and approximate maximum influence perturbations using efficient inverse-Hessian-vector product implementations. These bounds can then be used to better characterize and detect sources of bias in models ranging from generalized linear models to attention-based architectures. Part II (Control) introduces novel methods for controllable generation across different model scales and modalities. Chapter 3 develops an unsupervised, inference-time approach for the controllable generation task, authorship obfuscation, in small language models. Chapter 4 proposes an adaptive, interpretable framework for medium-sized models, supported by a newly created large-scale, multi-style dataset. Chapter 5 extends controllability techniques to vision-language models, presenting a lightweight self-improvement framework that enables iterative critique and revision without external supervision. Part III (Societal Impact) investigates the downstream consequences of AI bias on users. Chapter 6 presents interactive experiments showing that partisan bias in large language models can meaningfully influence political opinions and decision-making. Chapter 7 discusses the impossibility of political neutrality in AI and instead formalizes approximations, introduces techniques for achieving it at multiple conceptual levels, and evaluates contemporary models under this framework. Together, these contributions advance the study of Trustworthy AI by unifying statistical rigor with practical experimentation. The work not only strengthens our ability to diagnose and control AI behavior but also exposes its societal risks and outlines concrete pathways toward mitigating them.Item type: Item , Integrative Analysis of Non-Euclidean Data(2026-02-05) Buenfil, James; Lila, EardiIn large-scale imaging studies, a primary goal is to understand the relationship between distinct data views of study participants. For example, one data view could consist of patients' brain MRI scans, while a second view includes their lifestyle, demographic, or psychometric measures. A significant challenge is that these views are often subject to complex non-Euclidean constraints. Two settings arise: in some cases, the geometric constraints are known a priori, such as brain functional connectivity data which lie on the manifold of positive definite matrices. In other cases, no explicit manifold representation is available, and the underlying geometry must be learned from the data. Additionally, the relationships between these views are often weak, further complicating the analysis. Despite extensive work on data integration, most approaches fail to accommodate non-Euclidean constraints while providing interpretable embeddings. In this dissertation, we propose novel frameworks to identify interpretable relationships between heterogeneous data views, while accounting for their distinct underlying structures. Specifically, in Chapter 2, we develop a canonical correlation analysis model to integrate time-varying, manifold-valued data with high-dimensional data. Our approach leverages tools from Riemannian geometry to handle non-Euclidean constraints and introduces a group-sparsity penalty to select important variables. The proposed method shows improved empirical performance over existing approaches and is applied to dynamic functional connectivity data from the Human Connectome Project. Furthermore, we establish asymptotic consistency through both in-sample and out-of-sample error bounds for the estimated canonical directions and scores. In Chapter 3, we extend the proposed model to automatically learn interpretable embeddings from the data, thereby estimating its underlying geometry. To achieve this, we formulate a Partially Linear interpretable Canonical Correlation Analysis model (PLiCCA) and prove the existence of population solutions. We establish formal connections between PLiCCA and conditional latent-variable models, specifically, conditional variational autoencoders and conditional normalizing flows. We show that these latent-variable models can be interpreted as relaxations of the PLiCCA problem, where difficult global constraints are replaced by tractable local ones. This perspective enables efficient solving of PLiCCA via `proxy' problems derived from contemporary conditional generative models, providing an alternative to the models proposed in the first project when the underlying structure of the data is unknown.Item type: Item , Causal Effect Identification via Equivalence Classes of Acyclic Graphs and Data-Driven Adjustment(2025-10-02) LaPlante, Sara; Perković, EmilijaIdentifying a causal effect involves finding a function of observational densities that serves as an equivalent form of the interventional density of interest -- denoted by $f(y | do(x))$ in the unconditional setting and $f(y | do(x), z)$ in the conditional setting. This equivalency allows researchers to rely on observational data alone to estimate a causal effect. We consider the problem of identifying causal effects across three chapters. In our first chapter, we focus on identifying conditional effects through covariate adjustment in a setting where the causal graph is known up to one of two types of graphs: a maximally oriented partially directed acyclic graph (MPDAG) or a partial ancestral graph (PAG). We provide a necessary and sufficient graphical criterion for finding these sets when conditioning on variables unaffected by treatment, and we provide explicit sets from the graph that satisfy this criterion. In our second chapter, we continue exploring covariate adjustment but turn to focus on the unconditional setting where there is no prior knowledge of the underlying causal graph. We present two routes for finding adjustment sets that instead rely on in/dependencies in the data directly. One route applies a concept known as c-equivalence to extend the work of Entner et al. (2013) under a single treatment, and another provides sufficient criteria for finding adjustment sets under multiple treatments. In our third chapter, we return to conditional identification where the causal graph is known up to an MPDAG. But rather than focusing on covariate adjustment, we consider identification more generally. We develop a conditional identification formula, based on graphical criteria, that extends beyond settings where conditional adjustment sets exist, and we pair this with a necessary and sufficient criterion for when this identification is possible. Further, we extend the well-known do calculus to the MPDAG setting and build a conditional identification algorithm based on this calculus that is complete for identifying these conditional effects.Item type: Item , Statistical Learning from Shifting, Indirect, or Unseen Data: Efficient Algorithms and Theoretical Guarantees(2025-10-02) Mehta, Ronak; Harchaoui, ZaidA fascinating phenomenon underlying statistical machine learning and artificial intelligence is "out-of-distribution" (or OOD) generalization. Data can (and in some settings, must) be used to draw inferences regarding probability distributions other than the one from which they were sampled. Understanding this mystery gives promise to statistical analyses that exhibit a degree of universality, such as clinical trials whose conclusions reflect many subpopulations or pre-defined image/text encodings that can be used to solve many classification tasks simultaneously. This dissertation tackles the theoretical and algorithmic challenges of designing methods that exhibit these modern notions of generalization. Chapter 2 studies a learning framework called distributionally robust optimization (DRO), which promotes OOD by training models to optimize the worst-case expected loss achievable within a collection of possible training distributions. These maximum-type objectives present challenges for designing stochastic learning algorithms, as unbiased estimates of the gradient are not easily computed. We design an estimator equipped with a progressive bias (and variance) reduction scheme, for which the resulting algorithm is shown to have a linear convergence guarantee. Although our optimization results apply more generally to DRO problems, we focus attention on a subclass of objectives called spectral risk measures, which have appealing statistical and computational properties previously unexplored in machine learning. We provide theoretical and practical guidance on selecting the various problem parameters, such as the collection of distributions over which to maximize. Finally, we present (among others) extensions to group DRO, a popular extension of the framework amenable to training neural network models. Chapter 3 takes insights from the DRO application and pursues stochastic algorithms for a more general class of optimization problems, dubbed semilinear min-max problems. These objectives interpolate between the well-understood class of bilinear and relatively less-understood nonbilinear min-max problems, and have applications to problem classes such as convex minimization with functional constraints as special cases. We present the first complexity guarantees for this problem class, using a randomized algorithm with components inspired by the simulation literature (such as adaptive sampling of new data and adaptive averaging of historical data). We prove convergence guarantees in both convex and strongly convex settings with a fine-grained dependence on individual problem constants. The results yield complexity improvements in even specific cases, such as bilinearly coupled problems. We also provide a lower complexity bound on the performance of deterministic algorithms applied to the semilinear problem class. Chapter 4 shifts focus from the implementation of large-scale learning algorithms to their output. We investigate predictive models that learn via a pre-training procedure with unlabeled data and can then make predictions for downstream classification tasks (without having seen any directly labeled training data from that task). This capability, known as zero-shot prediction, is made possible by three ingredients: 1) massive, carefully curated pre-training datasets, 2) "self-supervised" labels that allow models to learn universal features of structured data (e.g., images/text), and 3) the translation of downstream data into the format seen during pre-training using a technique called prompting. We analyze all three ingredients theoretically by establishing both the sample complexity and the limits of prompting in terms of simple distributional conditions. Inspired by this theory, we explore variants on the pre-training objective and prompting strategies that show practical benefits such as improved zero-shot classification accuracy.Item type: Item , Bayesian Nonparametric Methods for Complex Datasets(2025-10-02) Jiang, Ziyu; Wakefield, Jon; Rodriguez, AbelModern data tend to present complex structures that challenge classical modeling assumptions and frameworks, including heterogeneity and spatial and/or temporal dependency. Bayesian nonparametric (BNP) models are a powerful tools that can address these challenges. They enable flexible modeling structures that adapt to the data complexity and provides uncertainty estimation. This dissertation proposes several BNP methods that are applicable to a wide range of statistical learning problems in regression, clustering and density estimation, with applications in fields including global health and financial econometrics. In Chapter 2, we proposed a novel model that integrates the Bayesian additive regression tree prior (BART) into the Gaussian process spatial model, aimed at spatial prediction problems where the covariate effects may be nonlinear and flexible. In Chapter 3, we studied and compared the computational performance for multivariate Hawkes processes (MHP) models, a temporal processes commonly used to model mutually exciting behaviors in temporal event sequences. In Chapter 4, we apply the dependent Dirichlet process (DDP) to model the temporal dynamics in the MHP models. Our model allows for flexible and adaptive modeling for excitation functions while borrowing information across dimensions. Future research directions related to the topic of this dissertation is outlined in Chapter 5.Item type: Item , Topics in Estimation and Inference with Multivariate Missing Data(2025-10-02) Suen, Daniel; Chen, Yen-ChiThis dissertation discusses statistical methodologies for complex missing data problems with a focus on multivariate and partially observed structures. Chapter 2 introduces a unified framework for handling multiple missing covariates and partially observed responses using inverse probability weighting, regression adjustment, and a multiply-robust procedure. Applications include the Cox model for survival analysis, missing responses, and binary treatment in causal inference, along with supporting identification and asymptotic theory. Chapter 3 focuses on modeling multivariate bounded discrete outcomes such as those from neuropsychological tests in dementia studies. We propose a flexible modeling strategy based on mixtures of experts and latent class models, extended to handle missing at random outcomes via a nested EM algorithm. The joint model also allows for imputation and clustering. Chapter 4 addresses nonmonotone missing data under missing not at random (MNAR) mechanisms, extending the work in Chapter 3. A tree graph is a directed acyclic graph on the missing patterns, and each one represents a MNAR mechanism. Combining this with the idea of a conjugate odds property, we are able to preserve distributional structure across missing patterns and construct relatively straightforward models for the full data distribution. Throughout the dissertation, we also highlight practical relevance using an Alzheimer's disease data set.Item type: Item , Leveraging network information to improve population size estimation in social and environmental applications(2025-08-01) Kunke, Jessica; McCormick, Tyler HSocial and ecological processes contain network structures such as interpersonal relationships and the flow of water through a river system. This dissertation develops methods for using such network information to improve population size estimates in both social (Chapter 2) and ecological (Chapters 3-4) contexts. The first project considers the problem of estimating the size of a human subpopulation that is hard to reach through traditional survey methods, and contributes a framework for studying network scale-up method (NSUM) estimator performance when certain modeling assumptions are violated. A cost-effective approach to estimating the size or prevalence of a subpopulation that is hard to reach through a traditional survey, the NSUM makes several strong assumptions, including the random mixing assumption that any two people are equally likely to know each other. The basic NSUM involves two steps: estimating respondents’ degrees or the number of people they know, then using these estimated degrees along with the number of people they report knowing in the hard-to-reach subpopulation of interest to estimate the prevalence of that subpopulation. Each of these two steps involves taking either an average of ratios or a ratio of averages, and using the ratio of averages for each step has been the most common approach. However, we developed theoretical arguments that using the average of ratios at the second, prevalence-estimation step often has lower mean squared error when the random mixing assumption is violated, which seems likely in practice; this estimator was proposed early in NSUM’s development but has largely been unexplored and unused. Simulation results using an example network data set also supported these findings. On the basis of this theoretical and empirical evidence, we suggest that future surveys using a simple estimator may want to use this mixed estimator, and estimation methods based on this estimator may produce new improvements. This joint work with Ian Laga, Xiaoyue Niu, and Tyler H. McCormick is published in \textit{Sociological Methodology} \citep{Kunke2024_NSUM}. The second project develops a class of scalable spatial stream network (S3N) models to do estimation, inference and prediction with spatial processes on stream networks on a spatial scale that was previously not feasible. Spatial process models are a standard approach to making regional estimates based on point observations, but classically they account only for covariance based on birds’ eye distance, and they are not scalable to large regions due to their computational complexity. Existing spatial stream network (SSN) models adapt such spatial processes to river networks by incorporating valid stream covariance functions, but preprocessing and estimation with these models is expensive and precludes the analysis of regions at the multi-state and national level in the United States. Our contribution is a scalable spatial stream network (S3N) model based on the SSN that uses nearest neighbor approximations and more efficient preprocessing to enable national and regional spatial process modeling on stream networks. We demonstrate the accuracy and computational efficiency of the S3N models relative to SSNs on simulated data on the Ohio River Basin stream network. This is joint work with Julian Olden and Tyler H. McCormick. The third project applies the S3N models developed in the second project to obtain what is to our knowledge the first set of fish population size estimates for over 300 species across the entire Ohio River Basin. Estimation on this scale was previously not possible, and the approach we demonstrate can be used to estimate freshwater fish populations by species over large regions. These estimates represent a critical step for biodiversity monitoring and conservation planning, as the geographic distribution of freshwater fish species at a national scale is currently unknown. Our publicly available code makes national and regional fish population size estimation accessible to the wider research community. This is joint work with Julian Olden and Tyler H. McCormick.Item type: Item , Bounds and Prediction Intervals for Individual Treatment Effects(2025-08-01) Zhang, Zhehao; Richardson, Thomas SThis dissertation investigates several problems related to bounds and prediction intervals for the individual treatment effect (ITE). While traditional causal inference has primarily focused on population-level parameters such as the average treatment effect (ATE) and the conditional average treatment effect (CATE), the ITE—often considered the ideal target for personalized decision-making -- has recently garnered increasing attention. However, the ITE is generally not identifiable from the observed data, even in the context of randomized experiments. As a result, we consider the problem of bounding the ITE using prediction intervals. In particular, when the marginal distributions of potential outcomes are identifiable from a large, well-conducted randomized experiment, we aim to answer the general question: what constraints exist on the joint distribution of potential outcomes, given these known marginals? Chapters 2 and 3 lay the theoretical foundation for addressing this question. In Chapter 2, we revisit a classical problem posed by Kolmogorov concerning the sharp upper and lower bounds for the cumulative distribution function (cdf) of the sum of two random variables with fixed marginals. Motivated in part by the challenges of bounding individual treatment effects, we focus on the \emph{achievability} of these bounds. Specifically, we distinguish between bounds that are achievable and those that although they provide an infimum or supremum -- and hence cannot be improved -- are not attained by any distribution. We contribute new results for the case of discrete random variables, and we also work to clarify, correct, and make more accessible several theorems in the existing literature. In Chapter 3, we apply the insights from Chapter 2 to the difference of two random variables, with an application on individual treatment effects. We identify and address logical gaps in some prior work and illustrate our results through an example. Then we connect the problem of characterizing joint distributions with fixed marginals to the theory of couplings of probability measures. We generalize a finite version of Strassen’s theorem using a max-flow/min-cut construction, which can be applied on prediction intervals (sets) for the ITE. Finally, we explore a natural extension: bounding the probability mass function (pmf) of the difference of two random variables. In Chapter 4, we build upon the results of the previous chapters and focus on prediction intervals for individual treatment effects (ITE). For a binary treatment, we consider all three types of outcomes: binary, ordinal, and continuous. We begin by examining how to construct valid prediction intervals given known marginal distributions. We then address the converse problem: what necessary conditions must hold for a joint distribution of potential outcomes to exist such that a given prediction interval is valid? We discuss scenarios in which certain points must necessarily be included in the interval. Finally, we compare and contrast the ITE with the average treatment effect (ATE), highlighting their differing implications for causal inference.Item type: Item , Survey-Based Methodologies for Enhanced Assessment of Cause of Death(2025-08-01) Fan, Shuxian; McCormick, TylerThis dissertation explores several statistical challenges in cause-of-death (COD) assessment from verbal autopsy (VA) surveys—structured interviews with caregivers of the deceased in regions where traditional medical certification is unavailable. Despite their crucial role in mortality surveillance, VA data analysis is complicated by inconsistent age categorization, respondent burden from lengthy questionnaires, and potential biases in automated classification systems. The first project develops a Bayesian framework for reconciling inconsistent age categories across multiple VA data sources. We formulate age-disaggregated death counts as fully classified multinomial data and show that incorporating partially classified aggregated data can produce an improved Bayes estimator under the Kullback-Leibler (KL) loss. Under specific theoretical conditions, this approach calibrates data with different age structures to generate unified estimates of standardized age distributions. Through numerical studies and applications to real-world mortality data, we demonstrate the method's effectiveness in imputing incomplete classifications and guiding appropriate levels of age disaggregation. The second project adopts Bayesian active questionnaire design to optimize VA data collection processes. Using posterior-weighted KL information criteria and uncertainty-aware stopping rules, this approach sequentially selects questions to maximize information while minimizing respondent burden. Validation with gold-standard VA data shows comparable classification accuracy using substantially fewer questions, with implications for improved data collection efficiency. The third project presents a statistical framework for valid inference using predicted causes from VA narratives. By extending prediction-powered inference (PPI) to multinomial classification, we enable unbiased parameter estimation when using natural language processing models for COD classification. Cross-site validation demonstrates effective correction for transportability errors and highlights the distinction between predictive accuracy and inferential validity. The last project proposes and validates a proof-of-concept Bayesian mixture model for estimating cause-specific mortality with incomplete age stratification. Using age-mixing proportions within a Bayesian framework, this approach shows that incorporating partially observed age data improves estimation compared to discarding incomplete records. Analysis of demographic survey data from multiple countries reveals that the proposed approach generally yields more accurate cause-specific mortality estimates, with performance advantages varying by the actual age distribution of deaths. Together, these methodological innovations address fundamental challenges in survey-based mortality surveillance, with applications extending beyond COD assessment to broader problems of inference with incomplete or predicted data.Item type: Item , Statistical estimation and decision-making for the COVID-19 pandemic(2025-01-23) Irons, Nicholas; Raftery, Adrian E; Cinelli, CarlosThis dissertation aims to provide policymakers and health practitioners with statistical tools and actionable information by which to make informed decisions, with a particular focus on the response to infectious disease outbreaks. In the first project, we quantify how many Americans contracted COVID-19 in the first year of the pandemic. We formulate a Bayesian epidemiological model utilizing multiple sources of information, including random sample testing surveys, to debias clinical COVID data and estimate SARS-CoV-2 prevalence and transmission rates in the United States through March 2021. We quantify the extent to which reported COVID cases underestimated true infection counts, which was large (with about 2 in 3 infections missed by testing), especially in the first months. Building on this work, in the second project we determine how to optimally respond to pandemics using non-pharmaceutical interventions (NPIs), which include social distancing measures, school and workplace closure, and testing, tracing, and masking policies. We first estimate the effects of NPIs on SARS-CoV-2 transmission in the US. Coupling these results with estimates of the costs associated to infections and NPIs derived from the public health and economics literature, we evaluate the cost-effectiveness of NPI policies in the year prior to the arrival of COVID vaccines and antiviral treatments. Going further, we frame the problem of policy design in terms of statistical decision theory, with which we derive optimal NPI strategies. We find that pandemic school closures were not cost-effective, but other measures were. In the third project, we propose a new method for the comparison of proportions—a foundational and ubiquitous statistical inference task relevant, in particular, to the analysis of randomized controlled trials with a binary outcome. Framing the problem as one of causal inference, we demonstrate how the likelihood can be cast in terms of clinically meaningful quantities, which facilitates interpretation, sensitivity analysis, and prior specification, and addresses the deficits of existing approaches. We demonstrate the utility of our method in empirical examples including a reanalysis of the Pfizer-BioNTech COVID-19 vaccine trial, which proved safe and highly efficacious in preventing SARS-CoV-2 infection.Item type: Item , Probabilistic Models for Human Migration Forecasting and Residency Imputation(2025-01-23) Welch, Nathan G; Raftery, Adrian EI develop probabilistic models to enhance the estimation and forecasting of human migration flows and residency. Using a Bayesian hierarchical approach, I first propose a model for forecasting global bilateral migration flows among the 200 most populous countries, producing well-calibrated projections that reduce error rates compared to existing methods. This model is integrated into a population projection framework to forecast migration flows by age and sex, providing the first probabilistic forecasts of international bilateral migration flows through 2045. Next, I address the influence of age structure on much longer-term migration forecasts by introducing the Migration Age Structure Index (MASI) that adjusts net migration rates, offering narrower prediction intervals and more accurate projections of population change, especially for aging populations. Finally, I improve the Person-Place Model (PPM), a key tool used by countries without population registers for census and intercensal population estimation, by developing the Bayes PPM—a Bayesian hierarchical model that refines residency estimates from administrative records. This model eliminates crude approximations to a well-defined statistical model currently in use, enhancing the accuracy of uncertainty intervals in demographic estimates. Collectively, these contributions offer more capable tools for forecasting migration and demographic changes, supporting policymakers in navigating complex global migration dynamics.Item type: Item , Scalable statistical methods for microbial metagenomics(2025-01-23) Teichman, Sarah; Willis, AmyScientific interest in microbiomes (communities of microscopic organisms in a given environment) has recently expanded due to the growing understanding of the role of the microbiome in human and environmental health, and in conjunction with the decreasing costs of metagenomic sequencing. However, there are several complications of the data that we observe from sequencing microbial samples that preclude the use of off-the-shelf statistical methods. Therefore, there is a high demand for statistical methods that are tailored to address scientific questions about microbiomes while accounting for relevant features of how the data are collected and processed. These methods must also be feasible and computationally efficient for the large scale of data that metagenomic sequencing produces. In my first project, I present a visualization method to compare estimated gene-level evolutionary histories to estimated genome-level evolutionary histories. Evolutionary histories are best represented by phylogentic trees, which are complex graph objects made up of nodes that represent biological categories, referred to as taxa, and edges that represent the evolutionary relationships between taxa. I use a local linear approximation of phylogenetic tree space to visualize estimated gene trees as points in a low-dimensional Euclidean space. I demonstrate the utility of my proposed visualization approach through two microbial data analyses. This visualization approach is scalable for large sets of gene trees that encode a large number of taxa. Next, I present another computationally scalable method for the analysis of metagenomic sequencing data. I extend the method of Clausen and Willis for taxonomic differential abundance analysis in order to make it computationally efficient for datasets with thousands of taxa. Through simulation, I demonstrate that my scalable method achieves similar Type I error rate control and power to the original method, and through data analyses I demonstrate that the two methods lead to very similar differential abundance conclusions. The differential abundance estimand in my method is defined with respect to a small set of reference taxa, and I suggest several approaches to choosing such a set and investigate how these approaches affect estimates and inference results through simulation and in a small data analysis. In my third project, I consider differential abundance analyses of molecular functions. I propose a novel functional abundance model, and show that in this model, the identifiable differential abundance parameter is a function of both biological parameters and unknown sequencing effects. I develop a framework to simulate data under my functional abundance model, and use this framework to study how different magnitudes of sequencing effects affect estimation and inference of these differential abundance parameters, relative to the true biological fold-differences in abundance that are scientifically relevant. In these simulations, I find that inference on the identifiable differential abundance parameter cannot reliably be used to draw conclusions about biological fold-differences in abundance, especially in the presence of sequencing effects with large magnitudes. To address this, I suggest careful interpretation of results from the differential abundance analysis of functional data in terms of a parameter that combines biological signal with sequencing artifacts. As a whole this dissertation presents three methods that address complex scientific questions with applications to microbiome science, each of which accounts for the effects of sequencing on microbiome data and is computationally efficient for the large scale of a typical metagenomic dataset.Item type: Item , Statistical Inference with Missing and Latent Data: Methods for Data Harmonization, Network Curvature Estimation and Experimentation Under Interference(2024-10-16) Wilkins-Reeves, Steven; Chen, Yen-Chi; McCormick, TylerThis dissertation explores several statistical challenges involving inference problems where the object of interest is a latent phenomenon or involves missing data. Effective modeling of the latent processes or missing data is crucial for accurate inference in such scenarios. We delve into issues of missing and latent data across three distinct settings. The first project addresses missing outcomes resulting from changes in neuropsychological test battery versions, where each version represents different testing models and scales. The second project focuses on inference for causal parameters using partially measured network data, also highlighting the experimental design challenges associated with such problems. The final project presents a nonparametric method for estimating network curvature from distance matrices. This approach emphasizes network models and introduces tests for constant curvature, providing a clearer understanding of the underlying network structure.Item type: Item , Problems in Identification and Estimation: Algorithms for Pathogen, Ancestral, and Rashomon Analysis(2024-10-16) Venkateswaran, Aparajithan; McCormick, Tyler H; Perković, EmilijaThis dissertation answers three questions on identifiability and estimability that arise in policy-making and causal discovery. First, we study contact tracing as a tool to prevent the spread of infectious diseases. We show how to substantially improve the efficiency of contact tracing using multi-armed bandits that leverage heterogeneity in how infectious a sick person is. We propose to test contacts of infected persons to ascertain whether they are likely to be a "high infector" and to find additional infections only if it is likely to be highly fruitful. Using administrative COVID-19 contact tracing datasets, we show that an easily implementable strategy in the field performs at nearly optimal levels. Second, we robustly estimate heterogeneities in the outcome of interest with respect to a factorial feature space. We partition this factorial space into "pools" of feature combinations where the outcome differs only across the pools. We fully enumerate the Rashomon Partition Set (RPS), a collection of all partitions with sufficiently high posterior density. Using the L0 prior, which we show is minimax optimal, we calculate approximation error relative to the entire posterior and bound the size of the RPS. In three empirical settings (charitable giving, chromosomal structure, and microfinance), we highlight robust conclusions, including affirmations and reversals of extant literature findings. Third, we restrict Markov equivalence classes of causal maximal ancestral graphs (MAGs) that agree with expert knowledge in the form of edge orientations. We can uniquely represent this equivalence class using its essential graph. We revise two previously described graphical orientation rules and present a novel rule to add expert knowledge. We provide an algorithm for adding expert knowledge and show that it is complete for edge marks in the circle component of the essential graph. We also provide an algorithm for verifying completeness in the general case.Item type: Item , Statistical Inference Using Identity-by-Descent Segments: Perspectives on Recent Positive Selection(2024-10-16) Temple, Seth David; Browning, Sharon RPositive selection is suggested to be the primary mechanism of phenotypic adaptation. Selective sweeps are one model of positive selection in which beneficial mutations increase in frequency. Many existing methods to detect positive selection do not adjust for multiple hypothesis tests. Additionally, many approaches to estimate the selection coefficient, a parameter that influences the rate of allele frequency change, lack uncertainty quantification. Here we develop theory and methodology to study recent positive selection with genetic data from the present day. Our methods use long identity-by-descent segments which should be unusually abundant in strong and recent selective sweeps. In our first project, we prove that the rate of detectable identity-by-descent segments around a locus is normally distributed for large sample size and large scaled population size. In our second project, we propose an estimator of the selection coefficient, with confidence intervals, that is an easy-to-interpret one-to-one non-decreasing function of the identity-by-descent rate. Furthermore, we provide methods to analyze selective sweeps regardless of whether the selected allele is known or genotyped. In our third project, we derive a multiple testing correction to control family-wise error rate when scanning for excess identity-by-descent rates. We apply our suite of methods to detect and model selective sweeps in European, African, and South Asian human populations.Item type: Item , Statistical Learning and Modeling with Graphs and Networks(2024-09-09) Wei, Zeyu; Chen, Yen-Chi; McCormick, Tyler HarrisGraph, consisting of a set of vertices and a set of edges, is a natural tool to study relations. From a geometric perspective, relations between data points reveal information about the underlying structure, and a graph as a geometric object can not only visualize but also mathematical characterize such geometric structures in the data. From a network perspective, graphs can also model connections between different units and have applications in various fields such as epidemiology, econometrics, sociology, biology, and astronomy. We first take advantage of graphs from a geometric perspective and propose a data analysis framework that constructs weighted graphs, called skeletons, to encode the geometric structures in the data and utilize the learned graphs to assist downstream analysis tasks such as clustering and regression. For clustering, we introduce a density-aided method that can detect clusters in multivariate and even high-dimensional data with irregular shapes. To bypass the curse of dimensionality, we propose surrogate density measures that are less dependent on the dimension and have intuitive geometric interpretations. The clustering framework constructs a concise graph representation of the given data as an intermediate step and can be thought of as a combination of prototype methods, density-based clustering, and hierarchical clustering. We show by theoretical analysis and empirical studies that skeleton clustering leads to reliable clusters in multivariate and high-dimensional scenarios. For regression tasks, we propose a novel framework specialized for covariates concentrated around some low-dimension geometric structures. The proposed framework first learns a graph representation of the covariates which encodes the geometric structures. Then we apply nonparametric regression techniques to estimate the regression function on the skeleton graph, which, notably, bypasses the curse of dimensionality. We derive statistical and computational properties of the proposed regression framework and use simulations and real data examples to illustrate its effectiveness. Our framework has the advantage that predictors for distinct geometric structures can be accounted for and is robust to additive noise and noisy observations. Graphs are widely used to represent networks of connections and serve as a helpful tool in modeling real-world diffusion processes.Network diffusion models are used to study things like disease transmission, information spread, and technology adoption. However, small amounts of mismeasurement are extremely likely in the networks constructed to operationalize these models. We show that estimates of diffusions are highly non-robust to this measurement error. First, we show that even when measurement error is vanishingly small, such that the share of missed links is close to zero, forecasts about the extent of diffusion will greatly underestimate the truth. Second, a small mismeasurement in the identity of the initial seed generates a large shift in the locations of the expected diffusion path. We show that both of these results still hold when the vanishing measurement error is only local in nature. Such non-robustness in forecasting exists even under conditions where the basic reproductive number is consistently estimable. Possible solutions, such as estimating the measurement error or implementing widespread detection efforts, still face difficulties because the number of missed links is so small. Finally, we conduct Monte Carlo simulations on simulated networks, and real networks from three settings: travel data from the COVID-19 pandemic in the western US, a mobile phone marketing campaign in rural India, and an insurance experiment in China.Item type: Item , Estimation and Inference of Optimal Policies(2024-09-09) Li, Zhaoqi; Luedtke, Alex AL; Jain, Lalit LJMany fields conduct experiments to learn policies that map individual characteristics to actions, with those achieving the best outcomes referred to as optimal policies. As getting human feedback from experiments is expensive, we are often interested in learning the optimal policy as quickly as possible. However, there are several challenges in developing practical approaches for policy learning. First, traditional methods usually only guarantee minimax optimality, while practitioners care more about performances for their particular problem instance. Therefore, a better notion of optimality than the worst case is needed. Second, existing optimal methods are generally hard to implement on a large scale, making deployment challenging for large companies. Third, real-world settings often involve multiple performance metrics of interest, such as mitigating side effects while ensuring good disease recovery in biomedical sciences or balancing short-term acquisition with long-term retention in digital marketing. This dissertation tackles these challenges and provides several practical approaches for policy learning from various perspectives. To identify the optimal policy as fast as possible, we frame policy learning as pure exploration problems in bandits and develop algorithms that provably identify the optimal policy quickly for every problem instance, a concept we refer to as instance optimality. We first focus on the stochastic contextual bandit problem in the PAC setting: given a policy class, the goal is to return a policy whose expected reward is near the optimal reward with high probability. We characterize the first instance-dependent PAC sample complexity of contextual bandits. We propose a new computationally efficient algorithm that achieves this sample complexity using only a polynomial number of calls to an argmax oracle. We then delve into the challenge of computational efficiency, focusing on developing algorithms that are easily implementable on a large scale. We focus on the linear bandit setting where we aim to return the arm with the largest reward given a set of arms and an unknown parameter vector. We introduce an algorithm that leverages the same oracles required by the widely-used Thompson sampling algorithm, namely sampling and argmax oracles, and achieves an asymptotically optimal exponential convergence rate. In addition, we demonstrate that our algorithm is easy to implement and performs empirically as well as existing optimal methods. We also explore the impact of the optimal policy on additional metrics when multiple objectives are of interest. We propose a novel margin condition that restricts how the subsidiary metric behaves for nearly optimal policies. Under this condition, we provide an efficient estimator for evaluating subsidiary metrics under a policy that is optimal for the primary one. Additionally, we introduce two alternative two-stage strategies that do not require a margin condition. Both methods first construct a set of candidate policies and then build a confidence interval over this set. We provide numerical simulations to assess the performance of these methods in various scenarios.Item type: Item , Statistical Methods for the Analysis and Prediction of Hierarchical Time Series Data with Applications to Demography(2024-02-12) Liu, Daphne Hong-Hsiao; Raftery, Adrian EThis dissertation develops new methods for the analysis and prediction of hierarchical time series data with a focus on applications to demography. The first two projects aim to estimate and project the potential effect that increases in education and access to family planning have on fertility decline in high-fertility countries. We first propose a new framework inspired by Granger causality for identifying the potential accelerating effect of education and family planning on fertility decline. We identify the mechanisms by which increases in education and access to family planning could lead to declines in fertility beyond what we would already expect the decline to look like based on past trends in fertility. We estimate the direct and indirect effects of education and family planning on fertility decline and explore how these effects differ within sub-Saharan Africa compared to other regions of the world. We build upon this work in the second project to propose a new method for conditional probabilistic projections of fertility given specific policy intervention outcomes targeting education and access to family planning. We develop a conditional Bayesian hierarchical model that creates conditional probabilistic projections of Total Fertility Rate (TFR) given probabilistic projections of women’s educational attainment, contraceptive prevalence of modern contraceptive methods, and GDP per capita. The conditional projection model enables the creation of projections corresponding different policy intervention scenarios targeting educational attainment and contraceptive prevalence. We illustrate the conditional projection model with a range of policy intervention scenarios corresponding to meeting the United Nations Sustainable Development Goals for universal secondary education and universal access to family planning by 2030. In the third project, we are motivated by the problem of missing data in a secondary school enrollment data set with two nonlinearly related measures of enrollment rates that have differing amounts of missing data. We propose a new method for multiple imputation of hierarchical nonlinear time series data that uses a sequential decomposition of the joint distribution and incorporates smoothing splines to account for nonlinear relationships between variables. Using a simulation study and an application to the school enrollment data, we show that the proposed method leads to substantial improvements in performance for estimation of parameters in uncongenial analysis models and for prediction of individual missing values compared to commonly used methods for multiple imputation of hierarchical time series data.Item type: Item , Exponential Family Models for Rich Preference Ranking Data(2023-09-27) Wagner, Annelise; Meilă, MarinaPreferences can be found in a wide array of contexts, from recommender systems, to opinion polls, consumer habits, and elections. The specific method of data collection, and the types of data collected can greatly vary the tools available for analysis. We seek to expand the class of exponential family ranking models by considering two types of more rich preference data. We first look at the Recursive Inversion Model, a highly flexible exponential ranking model that can reflect high level trends in ranking data with informative parameters for inference. We expand these models for partial rankings, rankings that more accurately reflect the true opinions of most individuals by allowing for non-strict orderings of preference. While this addition of partial rankings accounts for increased overhead in algorithmic and computational complexity of maximum likelihood estimation, we detail methods and algorithms that ensure tractability. We also utilize this same theory to provide algorithms for calculating conditional and marginal probabilities for the Recursive Inversion Model. Using this new theory, we demonstrate the usefulness of expression ratings and rankings, highlighting a novel method of data analysis for preference data expressed as ratings. We also expand on this further by proposing a new data structure, rankings with landmarks, which combine the relative and absolute preferences expressed in rankings and ratings into one. This new class of rankings requires the construction of new ranking models, of which the Landmark Generalized Mallows Model (L-GMMs) appears the most promising. We detail algorithms for maximum likelihood estimation of the L-GMMs, providing a solution to creating exponential ranking models containing non-invertible subsets, and demonstrate them on real world data.
