Learning Models of Gene Expression from Synthetic DNA Sequences
MetadataShow full item record
Over the past decade, new sequencing technologies have enabled the comprehensive cataloging of human genetic variation, but for most DNA sequence variants we do not even understand the impact on molecular phenotype, let alone human traits and disease. Despite recent advances, the throughput of gene editing technologies is orders of magnitude away from allowing the functional testing of all possible variants. As a result, models are desperately needed to accurately predict the impact of variants on gene expression. In classical genetics, genomes are compared, analyzed, and perturbed to learn the function of the underlying DNA sequences. Here I present a complementary but orthogonal approach, in which fully synthetic DNA sequences are studied. This thesis is a collection of three chapters exploring this approach. In the first chapter, noise buffering at the protein level is explored by constructing feed-forward loops with different engineered miRNA targets. In the second chapter, a model of alternative splicing is learned from millions of random DNA sequences. The resulting model outperforms all other models that we investigated on the task of predicting the effects of human exonic variants on alternative splicing, even though our model was never trained on human DNA sequences. In the third chapter, the same approach is extended to model the impact of the 5’ untranslated region on translation in yeast. After training on measurements of ~500,000 synthetic 5’ untranslated regions (5’ UTR), the resulting model accurately predicts the effects of both synthetic and native yeast 5’ UTRs on translational efficiency. This thesis is by no means a comprehensive exploration of gene expression models that can be learned from synthetic DNA sequences, but rather a starting point. The throughput of both DNA sequencing and DNA synthesis have increased at such a rapid pace that we can now make millions of gene expression measurements in parallel. Given that most genomes only have thousands of genes, we should leverage this additional capacity to test synthetic sequences and improve our models of gene expression.
- Electrical engineering