Statistical Methods for Inferring Population Structure with Human Genome Sequence Data

Kirk, Jennifer Lee

Statistical Methods for Inferring Population Structure with Human Genome Sequence Data

Files

Kirk_washington_0250E_16778.pdf (22.09 MB)

Date

2017-05-16

Authors

Kirk, Jennifer Lee

Abstract

Population structure is systematic variation in the human genome due to non-random mating because of physical or cultural barriers. Population structure is of interest in several fields of medicine, including population genetics, medical genetics, and personalized genomics. Advances in sequencing technology have lead to a precipitous drop in the cost to sequence the human genome, which has lead to a plethora of sequencing studies in recent years. This increase in the availability of genotype data has led to a commensurate increase in the number of statistical methods for analyzing sequence data. To date, the majority of these new methods have focused on association testing, with relatively little work on inferring population structure, despite the importance of population structure inference. There are several challenges to inferring population structure with sequencing data, including: an abundance of rare variants (loci where there is little variation across human populations) and the large number of loci. Existing methods are not directly applicable to rare variants and few computationally feasible methods exist. This dissertation considers the problem of inferring population structure with human genome sequence data. We present new statistical methods, with theoretical justification, extensive simulation studies, and applications to the 1000 Genomes Project data. We also develop extensions of the methods that are computationally feasible for large sequencing data sets and that allow for the use of reference population samples to better elucidate population structure from sequence data.