A Z-estimation System for Two-phase Sampling with Applications to Additive Hazards Models and Epidemiologic Studies

Hu, Jie

A Z-estimation System for Two-phase Sampling with Applications to Additive Hazards Models and Epidemiologic Studies

Files

Hu_washington_0250E_13830.pdf (1.32 MB)

Date

2015-02-24

Authors

Hu, Jie

Abstract

An observational epidemiologic study often follows a large amount of participants for occurrence of diseases. If every covariate is measured for every participant, then the study can be highly expensive. Two-phase sampling reduces costs by oversampling more informative subjects from a large phase I sample into a small phase II subsample; only subjects in the subsample are measured for the expensive covariates. Analyzing this type of data is challenging, particularly for association study and risk prediction based on a semiparametric model. It requires new theoretical tools, methods, software and data analysis examples. This dissertation answers these timely challenges. We provide statisticians with a new theory to develop new tools for two-phase studies. This theory is general. It is not specific to a particular model or a two-phase study design. It can be used for association study via estimating regression parameters or for risk prediction via estimating the entire model, including both parametric and nonparametric parts of a model. It encompasses both likelihood and non-likelihood based inference. It provides correct inference in the presence or absence of model misspecification. Because a broad problem area is taken into account by this theory, the theory can be also considered as a framework to guide a researcher through a model development process for a two-phase study. Next, we use our theoretical results to develop a semiparametric additive hazards model for general two-phase designs. We are able to obtain a collection of results systematically. These results include estimators for regression parameters, cumulative baseline hazards, and individual specific cumulative hazards from random sampling, two-phase sampling, two-phase sampling incorporating auxiliary information embedded in the phase I sample, as well as these estimators' model-based and robust asymptotic variances. Lastly, we apply our analyzing tools to an Atherosclerosis Risk In Community (ARIC) case-cohort study, we are able to use the biomarker information to create a new risk prediction function of coronary heart disease even though these biomarker information is only available for a selected sample. The individual risk profile calculated from this function can help physicians identify new patients who may not be discovered by traditional risk evaluation tools for prevention therapies. We then further improve prediction precision by incorporating additional information on standard risk factors in the main cohort. With these new tools for two-phase designs implemented in our software, researchers can use a new and expensive biomarker for risk prediction with substantially reduced costs.