Latent Modeling of the Human Epigenome

Schreiber, Jacob

Latent Modeling of the Human Epigenome

Files

Schreiber_washington_0250E_21271.pdf (23.16 MB)

CertificateOfCompletion.pdf (119.43 KB)

Date

2020-04-30

relationships.isAuthorOf

Schreiber, Jacob

Abstract

The human epigenome has been experimentally characterized by assays of methylation, histone modification, chromatin accessibility, and protein binding in hundreds of cell lines and tissues (“biosamples”). The result is a huge compendium of data, consisting of thousands of measurements for every basepair in the human genome. These measurements are immensely valuable, in large part because they measure forms of biological activity that differ across biosamples and can help explain many biosample-specific cellular mechanisms that cannot be explained by nucleotide sequence alone, particularly those driving development and disease. However, these data present two major challenges. The first challenge is that, due primarily to cost, the total number of assays that can be performed is limited. The second challenge is that, despite being incomplete, these compendia are already so large that they can be difficult for either humans or computational methods to make sense of. In this thesis, we address both of these challenges with a deep tensor factorization method, Avocado, that is trained to impute genome-wide epigenomics experiments. Avocado solves the first challenge by completing the compendium via imputation of all epigenomic experiments that have not yet been performed. Avocado solves the second challenge by learning a compression of the entire compendium into a dense, information-rich, latent representation. We first applied Avocado to a compendium of data produced by the Roadmap Epigenomics Consortium that contained measurements of chromatin accessibility and histone modification. Our results confirmed the strength of the Avocado model: first, we found that Avocado can impute epigenomic data more accurately than previous methods, and second, we showed that machine learning models that exploit Avocado’s learned representation outperform those trained directly on epigenomic data on a variety of genomics tasks. Next, we applied Avocado to the ENCODE Compendium, which is several times larger than the Roadmap Compendium and additionally includes measurements of protein binding and transcription. We demonstrate that, even in this more difficult setting, Avocado’s imputations are of high quality and that the predictions of protein binding outperform the top models in a recent ENCODE-DREAM challenge. Although the ENCODE compendium currently contains only a small fraction of potential experiments, the human epigenome remains the most characterized epigenome of any species. Accordingly, we extended Avocado to leverage the large number of human epigenomic data sets when making imputations in other species. We found that not only does this extension yields improved imputations of mouse epigenomics, but that the extended model is able to make accurate and biosample-specific imputations for assays that have been performed in humans but not in mice. Further, we found that our extension allows for an epigenomic similarity measure to be defined over pairs of regions across species based on Avocado’s learned representations and that this score can be used to identify regions with high sequence similarity whose functions have diverged. Finally, we sought to demonstrate the utility of these imputations for the challenging task faced by a scientific consortium such as the ENCODE Consortium, “Which experiments should ENCODE perform next?” We demonstrate how to represent this task as an optimization problem carried out using Avocado’s imputations. Compared with previous work that has addressed a similar problem, our approach has the advantage that it can use imputed data to tailor the selected list of experiments based on data collected previously by the consortium. We demonstrate the utility of our proposed method in simulations, and we provide a general software framework, named Kiwano, for prioritizing the order that genomic and epigenomic experiments should be performed. Taken together, the results presented in this thesis provide strong empirical evidence for the utility and robustness of Avocado. In multiple settings, the imputations generated by Avocado are of high quality, including the novel cross-species settings. The learned latent representations are able to encode epigenomic state in a compact manner, and even result in a way to identify orthologous regions that have diverged across species. Finally, we have shown that the imputations are informative and biosample-specific enough to help guide future experimental efforts. All of the results from this thesis are publicly available. The imputations can be found on the ENCODE portal (https://www.encodeproject.org). The model files for the first chapter can be found at https://noble.gs.washington.edu/proj/avocado/model/ and the model files for the second chapter can be found at https://noble.gs.washington. edu/proj/mango/models/. The code for Avocado can be found at https://github.com/ jmschrei/avocado and has been made available under an Apache v2 license, and the code for Kiwano can be found at https://github.com/jmschrei/kiwano under the MIT license.