Biostatistics

Permanent URI for this collectionhttps://digital.lib.washington.edu/handle/1773/4900

Browse

Recent Submissions

Now showing 1 - 20 of 242
  • Item type: Item ,
    Causal Inference in the Presence of Unmeasured Confounding: Advances in Mendelian Randomization and Proximal Causal Inference
    (2026-04-20) Wu, Yinxiang; Ye, Ting; Rotnitzky, Andrea
    Observational data are indispensable for causal inference, particularly when randomized controlled trials are infeasible due to ethical, logistical, or economic constraints. However, observational data are subject to unmeasured confounding, where unobserved variables influence both the treatment and the outcome, potentially leading to biased estimates and spurious findings. This dissertation aims to develop statistical methods for causal inference in the presence of unmeasured confounding, focusing on multivariable Mendelian randomization (MVMR) using summary-level data and proximal causal inference using individual-level data. MVMR uses genetic variants as instrumental variables (IVs) to infer the direct causal effects of multiple exposures on an outcome. In the first project, we develop a general asymptotic regime for many weak instruments, which allows for varying degrees of IV strengths across exposures, offering a more accurate asymptotic framework for studying MVMR estimators. We then propose a novel spectral regularized inverse-variance weighted estimator for estimating causal effects, and show that it is consistent and asymptotically normal under many weak IVs. The second project extends this work to settings involving many potentially highly correlated exposures. We develop a new estimator which minimizes a penalized debiased objective function that reduces weak instrument bias while yielding interpretable estimates with theoretical guarantees for variable selection. To enable valid post-selection inference, we adapt a data-thinning strategy to summary-data MVMR. In the third project, we aim to identify and estimate the average treatment effect (ATE) in a target population using data from an observational source study conducted within another population. To address unmeasured confounding within the source study and unmeasured effect modification, we adopt proximal causal inference, leveraging observed variables as proxies for unmeasured confounders and effect modifiers.
  • Item type: Item ,
    Statistical Methods for Assessing COVID-19 Vaccine Effectiveness and Immune Correlates in Test-Negative Design Studies
    (2026-02-05) Andrews, Leah Irene Binkley; Gilbert, Peter
    Vaccines have been essential for protecting against COVID-19 and must be continually evaluated and updated in post-marketing settings. Understanding the relationship between COVID-19 and immune markers correlated with vaccination or infection can also inform vaccine development and updates. The test-negative design (TND) is a resource-efficient observational study design that enrolls symptomatic individuals who obtain SARS-CoV-2 testing and compares vaccination status or immune marker measures between cases who test positive and noncases who test negative. While the TND reduces bias from healthcare-seeing behavior, additional biases from confounding, missing data, and selection mechanisms may persist. This dissertation proposes to improve existing TND analysis methods to obtain more robust and interpretable causal estimates of COVID-19 vaccine effectiveness and protective immune correlate levels. The first project extends a targeted maximum likelihood estimator under a partially linear logistic regression model to a TND setting. This semiparametric logistic regression method targets a causal conditional risk ratio of COVID-19 in the healthcare-seeking population and allows for flexible, data-driven confounding adjustment and missing data in the exposure variable. The second project investigates conditions that allow the TND to obtain unbiased and precise COVID-19 vaccine effectiveness estimates. This work reanalyzes five phase 3 COVID-19 Prevention Network vaccine efficacy trials as TND studies and evaluates if COVID-19 vaccines affect other causes of COVID-19-like symptoms. The third project extends a negative control method that addresses unmeasured confounding and selection bias in TND studies. This extension targets the causal risk ratios of viral genotype-specific symptomatic COVID-19 and incorporates inverse probability weighting and augmented inverse probability weighting to account for COVID-19 cases with missing viral genotypes.
  • Item type: Item ,
    Statistical Methods for Infectious Disease Prevention Studies with Varying Exposure Risk
    (2026-02-05) Dahl, Angela; Brown, Elizabeth
    In infectious disease prevention studies, participants' risk of infection can vary greatly due to differences in their underlying risk of being exposed to infection, which can complicate assessments of the association between an intervention and the risk of infection. For many infectious diseases with seasonal or otherwise temporal epidemic patterns, the risk of exposure to infection can vary dramatically with changes in the scale of the epidemic in the community, causing infection risk to be dependent on calendar time. In Chapters 2 and 3 of this dissertation, we provide statistical methods to account for differences in the risk of exposure to infection in time-to-event studies due to temporal epidemic patterns. Additionally, for diseases such as human immunodeficiency virus (HIV) and other sexually transmitted infections (STIs), variations in behavior can drive differences in participants' risk of exposure to infection but are difficult to capture with covariates. In Chapter 4, we identify settings in which unobservable heterogeneity in risk of exposure can cause bias in efficacy estimates and, in some cases, false effect modification, with a particular focus on nested case-control studies.
  • Item type: Item ,
    Continuous Exposures and Inverse Problems in Causal Inference
    (2026-02-05) Hemmady, Anand; Rotnitzky, Andrea; Carone, Marco
    This dissertation studies challenges that arise in causal inference with continuous exposures, with particular emphasis on the role of ill-posed inverse problems. Common causal estimands in continuous exposure settings are often difficult to interpret, challenging to estimate, or rely on strong and potentially unrealistic identification assumptions. The first project introduces and studies a class of stochastic interventions for continuous exposures that yield scientifically interpretable causal estimands which can be identified from observed data without reliance on the positivity assumption. We establish conditions for identification and propose and study an influence function-based estimator. The estimator’s performance is examined in simulations for both uncensored and right-censored outcomes, and the method is applied to the study of correlates of protection in an HIV vaccine trial. The second project considers a two-sample instrumental variable framework for causal inference when the exposure is observed with error. The causal estimand is formulated as a functional of a solution of an ill-posed integral equation, thus connecting the problem to recent work on statistical inverse problems. An estimating equations-based estimator is proposed, its asymptotic properties are studied, and its finite-sample performance is evaluated through simulations. The method is applied to data from the COVAIL study. The third project scrutinizes common assumptions used in the analysis of statistical inverse problems, which can be difficult to interpret in causal inference settings. These assumptions are explored using tools from microlocal and harmonic analysis, providing further insight into these assumptions and suggesting avenues for future work.
  • Item type: Item ,
    Statistical Methods for Excess Mortality Estimation with Variable Data Availability and Completeness
    (2026-02-05) Knutson, Victoria; Wakefield, Jon
    Reliable estimation of mortality is fundamental to demographic measurement and public health assessment. Excess mortality, the difference between observed and expected deaths, provides a comprehensive indicator of the total mortality impact of crises, encompassing both direct and indirect effects. The estimation of excess mortality across diverse countries and population subgroups, as well as the evaluation of the completeness of underlying death registration systems, presents significant methodological challenges. This dissertation develops statistical frameworks to improve the estimation, disaggregation, and validation of mortality measures in settings with incomplete or heterogeneous data, with a particular focus on the global impacts of the COVID-19 pandemic. The outline of the dissertation is as follows. In Chapter 2, we develop a model for estimating global and country-specific excess mortality during the COVID-19 pandemic using an overdispersed Poisson count framework. The approach jointly models total and expected deaths, incorporating both long-term trends and short-term seasonal variation, and extends estimation to countries without complete data through predictive log-linear and multinomial subnational models. These methods formed the basis for the World Health Organization’s global estimates of pandemic excess deaths for 194 countries. In Chapter 3, we extend this framework to the estimation of age- and sex-specific excess mortality. Expected death rates are modeled using an overdispersed Poisson regression with log-linear temporal trends and smooth age effects, while unobserved mortality rates are estimated through a reduced-dimensionality framework that leverages principal components analysis and country-level covariates. This chapter also investigates the sensitivity of excess mortality estimates to the specification of expected deaths and the choice of reference period. In Chapter 4, we address the problem of assessing the completeness of vital registration systems, particularly in low- and middle-income countries, by developing a probabilistic formulation of death distribution methods. This approach embeds the demographic balance equations relating deaths, births, and migration within a statistical framework that permits uncertainty quantification. We conclude with a discussion of future directions for this research. Together, these chapters contribute to a unified statistical foundation for mortality estimation, enhancing the accuracy, transparency, and interpretability of vital statistics across global contexts.
  • Item type: Item ,
    Statistical Methods to Estimate Evolutionary and Technical Parameters Using Whole Genome Sequence Data
    (2025-10-02) Masaki, Nobuaki; Browning, Sharon R
    Whole genome sequence data are widely used in humans and other species to reveal evolutionary patterns and recent demographic history. In this dissertation, we introduce three new statistical methods that can be used to estimate technical parameters such as genotype error rates, as well as parameters related to genome evolution and recent demographic history, using whole genome sequence data from humans and SARS-CoV-2. In our first method, we propose a model that calculates the likelihood of observed parent-offspring trio genotypes, adjusting for both genotype errors and uncalled deletions. We fit our model to SNVs in 77 White British trios identified in the UK Biobank whole genome sequence data, obtaining estimates for the genotype error and uncalled deletion rates in this dataset. In our second method, we formulate a model to estimate the mean length of gene conversion tracts. Our model uses a separate per-site allele conversion rate for each observed tract. We fit this model to gene conversion tracts detected from the UK Biobank whole autosome sequence data and infer the mean length of gene conversion tracts in humans. Finally, in our third method, we propose a hidden Markov model that accounts for mutations and genotype errors to detect recombinant SARS-CoV-2 sequences.
  • Item type: Item ,
    Methods for time series network analysis
    (2025-10-02) Hellstern, Michael; Shojaie, Ali
    Statistical networks can encode arbitrary relationships between variables in a system. Due to this flexibility, scientific hypotheses about interactions between variables can typically be formulated as a statistical network analysis. In addition to analyzing static networks, studying how statistical networks change in response to experimental or environmental conditions is often of scientific interest. A network is typically defined as a set of vertices and edges. Specifically a network or graph, G, can be written as G = (V, E) where V = {1,...,k} are the vertices or variables and E is the edge set that encodes the relationship between variables. A common example of a statistical network is the correlation matrix, where an edge represents the correlation between variables. While analysis of networks and their changes are ubiquitous across many domains, our work is motivated specifically by applications in which networks are derived from time series data. In contrast to independent data, statistical analysis of time series data is complicated by the inherent serial correlation. In practice, the degree of this correlation is unknown and network analysis methods that can flexibly handle varying degrees of dependence are needed. We approach this problem from two angles. The first angle, used in the first two portions of this thesis, focuses on developing methods with minimal assumptions on temporal dependence. In the third portion of this thesis we approach this problem from the second angle which attempts to leverage the flexibility of deep learning methods to analyze statistical networks. In the first chapter, we propose a novel order selection method in vector autoregressive (VAR) models. Order selection is an essential step in fitting VAR models and while many order selection methods exist, all come with weaknesses. Our proposed order selection method is based on the observation that the expected squared error loss is flat once the fitted order reaches or exceeds the true order. We show that under mild assumptions on the underlying process our new order selection method consistently estimates the true order. Motivated by applications in neuroscience, the second chapter of this thesis develops a novel estimation and inference procedure for a difference in the inverse spectral densities. In neuroscience, it is often of interest to study how brain networks change in response to electrical stimulation with the hopes of developing stimulation-based treatments for neurodegenerative diseases. Furthermore, it is essential to study networks in the frequency domain as higher frequencies contain key brain connectivity information. With this in mind, we develop methods to directly estimate and perform statistical inference on a difference in inverse spectral densities. Crucially, our method relies on minimal assumptions and can flexibly handle a large range of data dependence. The last chapter of this thesis proposes a new deep learning-based change-point detection framework. The core idea behind this method is a continuous approximation of the indicator function. With this approximation, change-points can be specified as parameters of a deep learning model. Thus, change-points and model parameters can be jointly learned using stochastic optimization techniques. The proposed framework is general and can be applied to both independent and dependent data, such as time series data. Furthermore, the framework is model-agnostic and thus can be used to encode networks and study their changes.
  • Item type: Item ,
    Robust and flexible statistical methods for clinical trials with innovative and practical designs
    (2025-10-02) Bannick, Marlena; Ye, Ting
    Randomized clinical trials (RCTs) are widely viewed as the gold standard study design for determining the efficacy of health interventions. Given that RCTs are time consuming and resource intensive, efficiency is of paramount concern. However, it is crucial to ensure that in the design, conduct, and analysis of clinical trials, expediency does not come at the expense of statistical rigor. In this dissertation, we take a careful look at trial designs that aim to be efficient and practical. The main through line for all of our projects is to propose modifications to the design, conduct, and/or analysis of these trials so that they can still reliably answer their intended questions. We are motivated by real problems encountered in real trials with commonly-used designs. We prioritize solutions to these problems that are easy to implement and understand. We justify our proposals with comprehensive asymptotic theory and practical simulation studies.
  • Item type: Item ,
    Advances in Proximal Inference for Continuous Exposures, Estimation of Ill-Posed Regression, and Non-Inferiority Assessment in Active-Controlled Trials
    (2025-10-02) Olivas-Martinez, Antonio; Rotnitzky, Andrea
    This dissertation advances proximal inference methods for continuous exposures and ill-posed regression problems, alongside non-inferiority assessment in active-controlled trials.An introduction to the topics is provided in Chapter 1. Chapter 2 develops a unifying framework for non-inferiority evaluation in active-controlled trials, where placebo arms are unavailable, enabling systematic comparison of existing methods in terms of type I error, power, and robustness to transportability misspecifications. Chapter 3, motivated by the analysis of immune correlates of protection in COVID-19 vaccine trials, extends the proximal inference framework to identify and estimate the mean outcome under modified treatment policies, enabling causal analysis of continuous exposures in the presence of unmeasured confounding. Chapter 4 addresses the challenge of estimating ill-posed nuisance functions in proximal inference and related settings, developing a novel finite-sample analysis of kernel-based regularized adversarial stabilized estimators and establishing conditions under which debiased, influence-function-based one-step estimators for a broad class of estimands, achieve root-n consistency and asymptotic normality. Together, these contributions support more robust inference in complex causal problems with unmeasured confounding or design constraints.
  • Item type: Item ,
    Large-scale snRNA-seq meta-analysis of microglia role in Alzheimer’s disease across statistical methods
    (2025-08-01) Zhang, Wenjing Tati; Lin, Kevin KZL
    Microglia orchestrate complex neurodegeration processes that drives Alzheimer’s disease (AD), yet their transcriptional signatures remain inconsistently reported across single-nucleus RNA-seq studies. We analyze three pre-frontal-cortex cohorts (Prater, SEA-AD, ROSMAP; ranging from 22 to 345 donors) with five differential-expression (DE) pipelines and introduce Was2CoDE – a Wasserstein-2–based test that partitions donor-to-donor differences into mean, variance and shape components. Three principal findings emerge. First, study design matters: the rigorously curated SEA-AD cohort reproducibly recovers the highest fraction of literature-validated microglia pathways, and power scales chiefly with the number of donors, not nuclei or read depth. Second, among DE frameworks, the matrix-factorization approach eSVD-DE delivers the most consistent gene- and pathway-level signals across independent datasets. Third, we shed light on the opportunity to discover underlying microglia mechanisms by analyzing differential distributions, which is broader than differential mean expression. Specifically, Was2CoDE uncovers distributional shifts missed by mean-centric tests, revealing variance-driven dysregulation in immune and cell-motility programs and highlighting genes such as ARHGEF3, CD9, and SASH1 that escape standard DE thresholds. Together, these results provide quantitative guidance for cohort design, benchmark analytic robustness and supply an open-source tool for full-distribution inference. By integrating method, design and distributional insights, our framework advances the search for microglia therapeutic targets in AD.
  • Item type: Item ,
    DAESC-GPU: A GPU-powered Scalable Software for Single-cell Allele-Specific Expression Analysis
    (2025-08-01) Cui, Tengfei; Qi, Guanghao
    Allele-specific expression (ASE) is a powerful signal to study cis-regulatory effects. We previously developed DAESC, a statistical method for single-cell differential ASE analysis across multiple individuals. Despite improved power, the lack of computational efficiency limits its utility on large-scale datasets. Here, we present DAESC-GPU, an accelerated version of DAESC powered by Graphics Processing Units (GPUs). DAESC-GPU is dozens of times faster than DAESC and scalable to datasets of over a million cells. Application of the software on single-cell ASE data from the OneK1K cohort identified novel genes with regulatory patterns specific to naïve and central memory CD4+ T cells.
  • Item type: Item ,
    The Instrumental Variable Model with Categorical Instrument, Exposure, and Outcome: Characterization, Partial Identification, and Statistical Inference
    (2025-08-01) Song, Yilin; Richardson, Thomas; Chan, Gary
    Instrumental variable (IV) analysis is a crucial tool in estimating causal relationships that addresses the issue of confounding variables that may lead to bias. Under certain IV assumptions, the causal effect may be partially identified. The binary IV model has been well studied in economics, statistics, and epidemiology, while IV models for general categorical exposure and outcome are less explored. This dissertation studies several aspects of the instrumental variable model with categorical instrument, exposure, and outcome including giving a characterization of the model (Chapters 2, 3 and 5), methods for statistical inference (Chapter 4), and a study of the variation independence properties of the marginal counterfactual distributions (Chapter 6). In Chapter 2, we first give a simple closed-form characterization of the set of joint distributions of the potential outcomes compatible with a given observed probability distribution via a set of inequalities. In Chapter 3, we further derive conditions for the inequalities in Chapter 2 to be non-redundant and construct the minimal set. To handle sampling variability, we provide an algorithm in Chapter 4 to construct confidence regions for any convex functional of the joint counterfactual distribution, such as the average causal effect (ATE), using a finite-sample tail bound for the KL-divergence due to Guo and Richardson [2021]. We also illustrate our methods in Chapters 2 and 4 using data from the Minneapolis Domestic Violence Experiment. In Chapter 5, we study falsification tests for the categorical IV model through simulations. We explore the variation dependence property of the marginal counterfactual distributions and discuss its practical implications in Chapter 6. We conclude with a discussion and directions for future work in Chapter 7.
  • Item type: Item ,
    Topics in Causal Inference for Individualized Treatment
    (2025-08-01) Galanter, Nina; Luedtke, Alex; Carone, Marco
    Optimal treatment rules, which use patient characteristics to tailor treatment decisions, are a promising way to improve outcomes when there is patient-to-patient variability in treatment effect. While investigators can best evaluate a treatment rule through a randomized trial of the rule against the standard of care, they must first find the best candidate rule or rules using other data. Causal inference methods enable leveraging electronic health records and standard clinical trial data to identify optimal rules and estimate their value. In this dissertation, we advance causal inference methodology for treatment rules. Our primary application is the treatment of major depression. In Chapter 1, we introduce individual treatment rules and our application of major depression treatment. In Chapter 2, we provide a method that uses summary statistics widely available in published clinical trial results to bound the benefit of optimally assigning treatment to each patient. In Chapter 3, we propose a method to estimate the value of treatment rules that optimize a primary outcome under the constraint that a risk outcome is below a specified threshold. In Chapter 4, we provide our constrained treatment rule value estimator with two alternative confidence intervals, a bootstrap confidence interval with improved finite sample performance, and an analytical confidence interval with improved theoretical guarantees.
  • Item type: Item ,
    Estimating HIV Cross-sectional Incidence Using Recency Tests from a Non-representative Sample
    (2025-08-01) Pan, Jianan; Gao, Fei
    Cross-sectional incidence estimation based on recency testing is an important tool in HIV research. This method has been used to estimate “placebo” incidence in active-control HIV prevention trials by applying the cross-sectional estimator to data from the screening population. The application of this approach faces challenges due to non-representative sampling, as individuals aware of their HIV-positive status may be less likely to participate in screening for an HIV prevention trial. To address this, a recent phase 3 trial introduced an test-based exclusion criterion: individuals were excluded during trial screening if they had recently taken an HIV test. To the best of our knowledge, the theoretical and empirical validity of applying a test-based exclusion criterion has yet to be studied. We develop a statistical framework that incorporates non-representative sampling and a testing-based exclusion criterion. We introduce a metric called the effective mean duration of recent infection that mathematically quantifies bias in the recency-based estimate of incidence. We investigate the performance of cross-sectional HIV incidence estimation in settings emulating current trial designs in an extensive simulation study. We find that when HIV negative individuals disproportionately attend screening for prevention trials, the traditional incidence estimator is unreliable unless all individuals with recent HIV tests are excluded from the sample. Additionally, we highlight a trade-off between bias and variability: excluding more individuals reduces bias from non-representative sampling but in many cases increases the variability of incidence estimates (even for a fixed sample size). Our findings emphasize the need for caution when applying the testing-based exclusion criterion and the importance of refining incidence estimation methods to improve the design and analysis of future HIV prevention trials.
  • Item type: Item ,
    Applying Compositional Data Analysis Methods To Complete Blood-Counts Data For Early COVID-19 Detection
    (2025-08-01) Zhang, Zhilong; Graffelman, Jan
    The investigation of using complete blood-count (CBC) data analysis for COVID-19 infection diagnosis has been a topic of interest in the last couple of years. It could be used as an affordable complementary tool to RT-qPCR and is particularly useful for developing areas struggling to test their population or suffering from massive COVID-19 infection. However, previous research of using CBC data for COVID-19 infection classification didn’t appreciate the compositional nature of white blood cell counts data. In this master’s thesis, we treat white blood cell counts data as compositional variables and apply compositional data visualization methods, using biplots based on log-ratio principal component analysis. Also, we apply compositional classification models to detect COVID-19 infection, using log-ratio linear discriminant analysis. We successfully illustrate the efficacy of compositional methods by building a compositional classification model superior to traditional models and highlight the benefits of analyzing CBC data from a compositional perspective. In a database of symptomatic individuals, we achieve a classification rate of 85% for the PCR-test result using the main CBC composition with some additional blood characteristics.
  • Item type: Item ,
    Modeling the Effects of Personal Characteristics on Longitudinal Cognitive Performance and Its Variability: An application to the Multicultural Healthy Diet Study
    (2025-08-01) Bai, Lu; Shaw, Pamela A
    Advances in digital data collection technologies have greatly improved researchers' ability to collect intensive longitudinal cognition data in real-world environments. Ecological momentary assessment (EMA), as one of such enhancements, involves repeated measures of ultra-brief cognitive tasks that obtains ecologically valid data via smartphones, thus allowing for less measurement error than traditional lab-based assessments. More sophisticated analytical approaches are required to disentangle the complex structure of this data. This thesis considers EMA cognitive data from a culturally diverse sample in the Multicultural Healthy Diet (MHD) study and provides an exploratory analysis using a multilevel mixed-effects model to better understand their associations with demographic and cultural factors, modeling both the mean and variability in cognitive performance. We start by providing the theoretical background of linear mixed-effects models as a powerful tool and a flexible multilevel modeling framework for longitudinal data analysis in Chapter 2. Chapter 3 presents a preliminary analysis on day-level averages to inform any potential learning effects and assess covariate influences across days. In Chapter 4, we extend our exploration of covariates effects to session-level data with an emphasison within- and between-person variance in relation to covariates. We conclude this thesis in Chapter 5 by summarizing these findings and discuss potential future work. Overall, this thesis offers a deeper insight into within-day and day-to-day fluctuations in cognitive performance among people from different cultural backgrounds.
  • Item type: Item ,
    Performance of weakly-supervised electronic health record-based phenotyping methods in rare-outcome settings
    (2025-08-01) Hong, Yunjing; Williamson, Brian D
    Background: Electronic Health Records (EHRs) enable large-scale biomedical research but key outcomes are often imperfectly captured, which is particularly important for rare outcomes in vaccine safety surveillance. Including features derived from natural language processing of chart notes in the EHR may improve computational phenotyping (prediction) performance. Performing manual chart review, the gold standard method for phenotyping, is expensive, limiting the amount of information for traditional supervised prediction algorithms. Methods: We evaluated three weakly-supervised phenotyping algorithms—PheNorm, MAP, and SureLDA—across simulated scenarios varying by disease prevalence (5% vs. 40%), label quality, and data complexity. Performance was measured using discrimination, precision, and calibration metrics across 2,500 replicates per scenario. We also applied probability- guided chart review to 1,028 potential anaphylaxis cases in a proof-of-concept study to see if reasonable cohorts for model development and evaluation could be obtained using thismethod. Results: Algorithm performance varied by context. Under optimal conditions, PheNorm and MAP achieved AUC > 0.99. In complex, low-prevalence settings, SureLDA variants outperformed others (AUC = 0.95 vs. 0.86 for PheNorm, 0.84 for MAP). Chart selection using predicted probabilities enriched for clinically meaningful cases compared to a random sample stratified on two covariates. Conclusions: Algorithm choice should reflect deployment conditions. SureLDA is robust in complex settings; PheNorm performs well with reliable documentation. Hybrid approaches improve phenotyping accuracy and efficiency in EHR-based vaccine safety surveillance.
  • Item type: Item ,
    A Comparative Study of Brain Structural and Functional Connectivity: Graph Topology, Individual Fingerprinting, and Predictive Modeling
    (2025-08-01) Li, Yuhong; Lila, Eardi EL; Shojaie, Ali AS
    Brain connectivity analyses using neuroimaging data provide insights into the structural and functional organization of the human brain. Several approaches have thus been proposed for modeling structural and functional connectivity, each with its own strengths and limitations. This thesis compares functional connectivity (FC) estimates derived based on correlation and partial correlation, evaluating their graph topology and performance in individual identification and prediction tasks, while also contrasting them with structural connectivity (SC) networks. We begin by estimating FC networks based on marginal and partial correlation and further explore low-order partial correlation graphs as an intermediate approach. Motivated by studies suggesting that brain structural hubs are closely related to functional activity, we also propose an alternative FC construction method by regressing out the temporal activity of SC hubs. Our analysis shows that SC emphasizes within-hemisphere connections and exhibits small-world properties, while FC consistently reveals strong interhemispheric connections, regardless of the methodology. However, other network properties vary depending on the estimation method used. Finally, we assess these networks’ abilities to capture individual-specific features through subject identification and behavioral prediction tasks. We observe that partial correlation-based FC network performs well in subject identification when the sample size is large, but its performance deteriorates sharply with smaller samples. Additionally, regardless of estimation method, FC networks consistently outperform SC in predicting behavioral variables, and combining both typically improves predictive accuracy, except for partial correlation-based FC. Sample size and number of scan sessions used in FC estimation also have a non-trivial impact on predictive performance. Our study highlights the methodological implications of FC estimation strategies for brain network analysis.
  • Item type: Item ,
    Applications of Identity-By-Descent Analysis in Population Genetics Research: Methods for Demographic History Inference and Genetic Association Analysis
    (2025-05-12) Cai, Ruoyi; Browning, Sharon R.
    The analysis of identity-by-descent (IBD) segments is an important tool in population genetics research and has led to important discoveries about human genetic history and population structures. While previous research has made significant progress in utilizing IBD segments to study population genetics, ongoing methodological advancements and increasing availability of genomic data create new opportunities for further exploration of the power of IBD segments in population genetics research. In this dissertation, we further explore the potential of IBD information in two key areas of genetics research: inferring demographic history and performing genetic association analysis (IBD mapping) for complex traits. First, we present a method to estimate the X chromosome effective population size using X chromosome IBD segments, and we demonstrate how the X chromosome effective population size can be combined with autosomal effective population size to inform sex-specific demographic history. Second, we introduce an IBD mapping approach for association analysis between genome-wide loci and complex traits, along with a novel multiple testing adjustment strategy that accounts for the correlation structure among test statistics in genome-wide IBD scans. Our research contributes new statistical and computational tools that enhance the use of IBD information in demographic inference and genetic association studies, improving our understanding of human evolutionary history and the genetic architecture of complex traits.
  • Item type: Item ,
    Methods and Software for Small Area Estimation in Low- and Middle-Income Countries
    (2025-05-12) Wu, Yunhan; Wakefield, Jon
    Demographic and health disparities in low- and middle-income countries (LMICs) persist, yet household surveys, the major data source for many indicators, lack the granularity needed for localized estimates due to data sparsity. This dissertation advances small area estimation (SAE) methods for demographic and health indicators in LMICs, focusing on Bayesian hierarchical modeling to improve precision, account for survey design complexities, and develop novel frameworks for fine-scale subnational estimates of key indicators. In Chapter 2, we incorporate urban/rural stratification into unit-level models to correct biases from urban oversampling. In Chapter 3, we propose ultimate years of schooling, a birth cohort-based measure that estimates final educational attainment while accounting for ongoing schooling trajectories and right-censoring in survey data. In Chapter 4, we develop a fertility estimation framework that integrates spatial, temporal, and maternal education effects to capture demographic trends at a subantional level. In Chapter 5, we present SurveyPrev RShiny, an interactive application that translates advanced SAE methods into an accessible tool for researchers and policymakers.