Conditional Generation of Protein Sequence and Structure
Loading...
Date
Authors
Lisanza, Sidney
Journal Title
Journal ISSN
Volume Title
Publisher
Abstract
The advent of atomic accuracy protein sequence structure prediction with deep learning networks spurred by AlphaFold has had a remarkable impact on the field of biochemistry. It has resulted in rapid progress in protein design because it allows for the quick interrogation of structural hypotheses without the need to acquire experimental data which is expensive and time consuming. However, it remains elusive how to properly use these deep learning models for the generation of protein sequences and structures with user defined functional and biochemical properties. This was the focus of my dissertation work. I first interrogated this question by taking pre trained structure prediction networks, namely RoseTTAFold, and applying techniques from image processing to make them generative in a method termed “constrained Hallucination”. I apply the technique to optimize sequences such that their predicted structures contain desired functional sites on a slew of design problems ranging from epitope scaffolding, metal binding, to protein binding. Experimentally characterization of these designs demonstrate the have the desired activities. In follow up work, I improve upon joint sequence-structure generation by employing the denoising diffusion probabilistic framework popularized in image generation. I developed ProteinGenerator, a sequence space diffusion model based on RoseTTAfold that simultaneously generates protein sequences and structures. Beginning from random amino acid sequences, the model generates sequence and structure pairs by iterative denoising, guided by any desired sequence and structural protein attributes. To explore the versatility of this approach, I designed and tested proteins enriched for specific amino acids, with internal sequence repeats, with masked bioactive peptides, with state dependent structures, and with key sequence features of specific protein families. And lastly looking to the future, particularly difficult protein design problems such as the design of highly active enzymes, experimental data feedback is necessary to improve functionality with minimal design iterations. Active learning (AL) and bayesian optimization (BO) approaches provide a principled way to incorporate experimental feedback into the design process, and subsequently minimize the number of iterations cycling between computation and experimental testing to optimize the desired function. However, these approaches do not incorporate strong generative priors to bias exploration/exploitations to valid regions of protein space. Therefore to improve upon current BO and AL methods, I hypothesize that coupling a joint sequence and structure diffusion model with bayesian optimization methods will allow for the more efficient search of the sequence activity landscape to find highly active variants. To this end I developed a joint sequence and structure denoising generative model, ProteinGenerator2 (PG2), to which I bias generation with both zero shot predictors to yield predicted highly active and diverse sequence pools for testing.
Description
Thesis (Ph.D.)--University of Washington, 2023
