Challenges Associated with Statistical Analysis in the Presence of Sparse Data and Applications to Alternative Tobacco Product Research
MetadataShow full item record
Rarely observed covariate combinations, or "sparsity" is a phenomenon associated with research concerning the health risks of alternative-use (non-combusted tobacco products (AUPs)). Of particular concern is sparsity relating to AUP users who do not currently or formerly use other tobacco products. This thesis aims to identify reasons why sparsity is a concern, the effect that sparsity can have on statistical inference, and potential appropriate approaches in the presence of sparsity. Special attention will be paid to scenarios in which sparsity can lead to inference that results in estimates of the AUP effect that are in the opposite direction of the true effect (e.g. found to be harmful when truly beneficial) and to be in an opposite direction related to the cigarette effect (e.g. found to be less harmful than cigarettes when truly more harmful). The impact of sparsity will be assessed primarily by constructing examples from both case-control and cohort studies and investigating the results from common statistical modeling methods under sparse and non-sparse conditions. These examples will include hypothetical examples constructed to approximate real world study design as well as data from a published study of an AUP. These examples will focus on issues of sparsity in relation to interaction assumptions and model scale assumptions. Conditional parameter estimates can vary widely from the marginal estimates for that parameter. Data sets with few subjects who use AUPs without also using cigarettes have reduced power to detect interaction. When scale or interaction assumptions are violated estimation of incidence rate or parameter values can be biased. This bias can be such that conclusions from analysis of sparse data sets can be misleading. These issues can cause AUP use to be estimated as beneficial when it is in truth harmful, or as less harmful than cigarettes when in truth it is more harmful. These issues are of such severity that we, if it is not possible to oversample the sparse categories, recommend restricting analysis to subgroups in which sparsity is unlikely to be a concern.
- Biostatistics