Understanding human genome regulation through entropic graph-based regularization and submodular optimization
Libbrecht, Maxwell Wing
MetadataShow full item record
I am interested in developing computational methods to improve understanding of human genome regulation. This thesis is organized around two novel machine learning methods. First, I present a new method in the field of posterior regularization. The genomic neighborhood of a gene influences its activity, a behavior that is attributable in part to domain-scale regulation, in which regions of hundreds or thousands of kilobases known as domains are regulated as a unit. I developed a method called entropic graph-based posterior regularization that makes it possible to jointly model all available genomic data sets, including chromatin state information from ChIP-seq and chromatin conformation information from 3C-based assays. Using this approach, I produced a comprehensive model of chromatin domains in eight human cell types, thereby revealing the relationships among known domain types and identifying a new category of domain which I term "specific expression domains." Second, I present new methods from the field of submodular optimization. Due to the high cost of sequencing-based genomics assays such as ChIP-seq and DNase-seq, a panel of at most 3-10 assays is usually performed on each cell type. I present submodular selection of assays (SSA), a method for choosing a diverse panel of genomic assays that leverages methods from the field of submodular optimization. SSA performs better than alternative strategies in practice, is computationally efficient and extremely flexible, and is theoretically optimal under certain assumptions. This application may also serve as a model for how submodular optimization can be applied to other discrete problems in biology. I applied a similar technique to remove redundancy in protein sequence data sets. This method applies submodular optimization to choose representative sets of protein sequences, achieving better results both in terms of reduction in redundancy and functional diversity. These methods are widely applicable to computational problems in genome biology and provide opportunities for the further development of methods.