Predicting, Engineering and Interpreting Gene Regulatory Sequences and Proteins with Deep Learning

Linder, Johannes Staffan Anders

Predicting, Engineering and Interpreting Gene Regulatory Sequences and Proteins with Deep Learning

dc.contributor.advisor	Seelig, Georg
dc.contributor.author	Linder, Johannes Staffan Anders
dc.date.accessioned	2021-08-26T18:08:38Z
dc.date.issued	2021-08-26
dc.date.submitted	2021
dc.description	Thesis (Ph.D.)--University of Washington, 2021
dc.description.abstract	The vast majority of the 3.1 billion base-pairs in the (haploid) human genome do not code for a particular protein, yet mutations in these non-coding regions can have a profound impact on phenotype and be deleterious. The reason is that within these regions - enhancers, promoters, introns and untranslated regions (UTRs) - reside a cis-regulatory code which governs gene expression and is sensitive to disruption. Ongoing efforts of mapping the relationship between genetic variants and disease phenotype are limited by data and the lack of generalizability. Furthermore, engineering \textit{de novo} gene-regulatory sequences and proteins according to target specifications, which would aid the development of vaccines, medical therapeutics, molecular sensing devices and more, is hampered by the lack of methods that can reliably generate large sets of diverse and optimized candidate designs for high-throughput screening. This dissertation presents an approach combining Massively Parallel Reporter Assays (MPRAs) with Deep Learning to obtain a sequence-predictive model of Alternative Polyadenylation (APA), a regulatory process occurring mainly in the 3' UTR of pre-mRNA. The trained neural network predicts 3'-end cleavage at base-pair resolution and can accurately prioritize human variants. By developing methods to visualize features learned in higher-order network layers, we extract a cis-regulatory APA code that aligns well with established biology. Next, the dissertation presents a family of methods that were developed to design de novo biological sequences based on the response of a differentiable fitness predictor. These methods, which are based on activation maximization, can be used to efficiently generate millions of diverse, optimized sequence designs on the basis of a deep generative model. Finally, we present a feature attribution method for interpreting neural network predictions. The method, which learns input masks that either reconstruct or destroy the prediction, implements a masking operator based on probabilistic sampling that is shown to be particularly well-suited for interpreting biological sequence models. The developed design- and interpretation methods are demonstrated on several DNA-, RNA- and protein function predictors and outperform state-of-the-art methods for multiple target applications.
dc.embargo.lift	2022-08-26T18:08:38Z
dc.embargo.terms	Restrict to UW for 1 year -- then make Open Access
dc.format.mimetype	application/pdf
dc.identifier.other	Linder_washington_0250E_22853.pdf
dc.identifier.uri	http://hdl.handle.net/1773/47424
dc.language.iso	en_US
dc.rights	CC BY-NC-SA
dc.subject
dc.subject	Computer science
dc.subject.other	Computer science and engineering
dc.title	Predicting, Engineering and Interpreting Gene Regulatory Sequences and Proteins with Deep Learning
dc.type	Thesis

Files

Original bundle

Now showing 1 - 1 of 1

Name:: Linder_washington_0250E_22853.pdf
Size:: 5.62 MB
Format:: Adobe Portable Document Format

Download

Collections

Computer science and engineering