Missing Data Methods for Observational Health Dataset

Cheng, Gang

Missing Data Methods for Observational Health Dataset

Files

Cheng_washington_0250E_24856.pdf (2.89 MB)

Date

2022-09-23, 2022-09-23, 2022-09-23

relationships.isAuthorOf

Cheng, Gang

Abstract

This dissertation is motivated by missing data problems arising from two observational health datasets. The first dataset is created by the SWOG study that linked medicare claims to a prostate cancer prevention trial dataset. The second dataset is a diabetes EHR dataset that contains longitudinal measurements of diabetes patients for 11 years. For the first dataset, we are interested in estimating the long-term effect of a treatment.In a time-to-event setting, medicare claims are linked to clinical trial data to extend the follow-up period for trial participants. This allows the estimation of the long-term effect that cannot be estimated by clinical trial data alone. However, such data linkages are often incomplete for various reasons. We formulate incomplete linkages as a missing data problem with careful considerations of the relationship between the linkage status and the missing data mechanism. We propose a conditional linking at random (CLAR) assumption and an inverse probability of linkage weighting (IPLW) partial likelihood estimator. We show that our IPLW partial likelihood estimator is consistent and asymptotically normal. % We further extend our approach to incorporate time-dependent covariates and apply it the SWOG study. For the second dataset, the longitudinal measurements for diabetes patients are subject to nonmonotone missingness. The conventional ignorability and missing-at-random (MAR) conditions are unlikely to hold for nonmonotone missing data and data analysis can be very challenging with few complete data. We introduce the available complete-case missing value (ACCMV) assumption for handling nonmonotone and missing-not-at-random (MNAR) problem. Our ACCMV assumption is applicable to dataset with a small set of complete observations and we show that the ACCMV assumption leads to nonparametric identification of the distribution for the variables of interest. We further propose an inverse probability weighting estimator, a regression adjustment estimator and a multiply-robust estimator for estimating a parameter of interest. Asymptotic and efficiency theories of the proposed estimators are studied. We further illustrate the applicability of our method by applying it to the diabetes EHR dataset. Finally, we consider the problem of trajectory recovery. Repeated measurements collected from individuals naturally form a long trajectory and the length of the trajectory creates additional difficulty for modeling and computation. We introduce a block-Markov type assumption to handle such missing data problems. We prove that our assumption leads to nonparametric identification of the joint distribution of the trajectory. Based on this assumption, we are able to decompose trajectories into multiple missing blocks and thus greatly reduce both the computation and modeling complexity. For modeling purpose, we further propose a model-based assumption, which allows us to use both linear models and flexible machine learning models to impute missing values. We further illustrate the applicability of our method by applying it to the diabetes EHR dataset.