Predicting, Engineering and Interpreting Gene Regulatory Sequences and Proteins with Deep Learning

dc.contributor.advisorSeelig, Georg
dc.contributor.authorLinder, Johannes Staffan Anders
dc.date.accessioned2021-08-26T18:08:38Z
dc.date.issued2021-08-26
dc.date.submitted2021
dc.descriptionThesis (Ph.D.)--University of Washington, 2021
dc.description.abstractThe vast majority of the 3.1 billion base-pairs in the (haploid) human genome do not code for a particular protein, yet mutations in these non-coding regions can have a profound impact on phenotype and be deleterious. The reason is that within these regions - enhancers, promoters, introns and untranslated regions (UTRs) - reside a cis-regulatory code which governs gene expression and is sensitive to disruption. Ongoing efforts of mapping the relationship between genetic variants and disease phenotype are limited by data and the lack of generalizability. Furthermore, engineering \textit{de novo} gene-regulatory sequences and proteins according to target specifications, which would aid the development of vaccines, medical therapeutics, molecular sensing devices and more, is hampered by the lack of methods that can reliably generate large sets of diverse and optimized candidate designs for high-throughput screening. This dissertation presents an approach combining Massively Parallel Reporter Assays (MPRAs) with Deep Learning to obtain a sequence-predictive model of Alternative Polyadenylation (APA), a regulatory process occurring mainly in the 3' UTR of pre-mRNA. The trained neural network predicts 3'-end cleavage at base-pair resolution and can accurately prioritize human variants. By developing methods to visualize features learned in higher-order network layers, we extract a cis-regulatory APA code that aligns well with established biology. Next, the dissertation presents a family of methods that were developed to design de novo biological sequences based on the response of a differentiable fitness predictor. These methods, which are based on activation maximization, can be used to efficiently generate millions of diverse, optimized sequence designs on the basis of a deep generative model. Finally, we present a feature attribution method for interpreting neural network predictions. The method, which learns input masks that either reconstruct or destroy the prediction, implements a masking operator based on probabilistic sampling that is shown to be particularly well-suited for interpreting biological sequence models. The developed design- and interpretation methods are demonstrated on several DNA-, RNA- and protein function predictors and outperform state-of-the-art methods for multiple target applications.
dc.embargo.lift2022-08-26T18:08:38Z
dc.embargo.termsRestrict to UW for 1 year -- then make Open Access
dc.format.mimetypeapplication/pdf
dc.identifier.otherLinder_washington_0250E_22853.pdf
dc.identifier.urihttp://hdl.handle.net/1773/47424
dc.language.isoen_US
dc.rightsCC BY-NC-SA
dc.subject
dc.subjectComputer science
dc.subject.otherComputer science and engineering
dc.titlePredicting, Engineering and Interpreting Gene Regulatory Sequences and Proteins with Deep Learning
dc.typeThesis

Files

Original bundle

Now showing 1 - 1 of 1
Loading...
Thumbnail Image
Name:
Linder_washington_0250E_22853.pdf
Size:
5.62 MB
Format:
Adobe Portable Document Format