The Bayesian Analysis of Data Arising from Complex Sampling Designs
MetadataShow full item record
The majority of this thesis concerns the development of Bayesian methods for two-phase studies. A two-phase study is a study design in which limited information (including outcome) is known for a large "first-phase" study population, and more extensive information is known for a "second-phase" subset of these individuals. The two-phase study design we are considering is one in which a simple random or case-control sample is taken from a population at phase I, and is cross-classified with respect to outcome and confounder variables. At phase II, individuals are sampled within the cells of the cross-classified data, with additional data collected on exposure variables. Clearly such a design requires specialized methods of analysis to acknowledge the non-random (outcome-dependent) sampling scheme. The benefit of the two-phase design is that large efficiency gains are possible by judicial choice of the phase I confounder variables and the phase II sample sizes. A number of likelihood-based methods have been developed for the analysis of two-phase data, but we describe a Bayesian approach, which has previously been unavailable. The benefits of a Bayesian approach include relaxation of the reliance on asymptotic inference, and the potential to model data with complex dependencies, for example through the introduction of random effects. The proposed approach uses a log-linear model for the disease-exposure-confounder relationship, and specifies a multivariate normal prior distribution on a reduced set of main effect and interaction terms in the log-linear model. We extend the methodology to include random effects terms in the log-linear model to perform different kinds of smoothing. In particular, we are interested in the use of two-phase studies in a spatial epidemiological context where one may wish to account for confounding by location by the introduction of spatial random effects. We assign independent normal priors on the non-spatial random effects, and an intrinsic conditional autoregressive (ICAR) prior on the collection of spatial random effects. Random effects can also be included in the log-linear model to smooth the cell probabilities in large contingency tables, particularly in the case of sparse data. The Bayesian two-phase approach is illustrated using data collected on Wilms tumour in children, and data on infant mortality in North Carolina. In the last part of the thesis, we consider small area estimation in the context of the developing world. There is a distinct lack of accurate, timely, full-coverage civil registration data in the developing world, and as such, vital statistics cannot be obtained from these countries. This data is needed to formulate good public health programs, develop regional, national, and global policies and implement and evaluate public health actions. We describe an integrated data collection and statistical analysis framework for improved mortality monitoring in areas without comprehensive vital records systems. In particular, we propose the use of statistically informed sampling to increase the efficiency of sampling and to ensure that sufficient data is collected on rare populations. To do so, we use existing information from democratic surveillance system sites to construct a mortality model based on village-level characteristics. On the basis of this model, we subsequently predict the number of deaths of interest in each village in the study region, and sample proportionately in each village. The sampled deaths are then modeled as a function of known demographic factors and village-level characteristics, and we use spatial smoothing to tune the model to each village and exploit similarities of risk in neighbouring villages. The method is illustrated using a simulated data set based on a real democratic surveillance system site in Agincourt, South Africa.
- Biostatistics