Statistical methods for genomic sequencing data

Min, Alan Ted

Statistical methods for genomic sequencing data

Files

Min_washington_0250E_26057.pdf (20.42 MB)

Date

2023-09-27, 2023-09-27, 2023-09-27

relationships.isAuthorOf

Min, Alan Ted

Abstract

Genomic sequencing data has revolutionized our understanding of the genetic basis of biological processes. The cost of sequencing the first human genome was estimated to be greater than 50 million dollars. However, with the advent of next generation sequencing, that cost has decreased to a few hundred dollars. It is thus now possible to use sequencing technology to understand nuanced aspects of the cell, both on the population and at the single-cell level. In this dissertation, we present three projects that develop statistical methods for analyzing genomic data. In the first project, we discuss how heritability estimators based on single nucleotide polymorphisms are affected under alternative structures of linkage disequilibrium. We demonstrate that linkage disequilbrium has the potential to bias modern estimators of heritability. In the second project, we investigate a sequencing-based assay that measures local chromatin structure. In this context, we propose a prior that allows a latent Dirichlet allocation model chromatin accessibility data to leverage auxiliary data. In the third project, we consider the connection between sequence data and epigenomic or expression data in the context of multitask learning models. We demonstrate that this multitask learning setup can lead to inaccurate models, when genomic features that are irrelevant for one task are erroneously assigned significance in a related task.