Model-based design of regulatory DNA for cell type-specific gene expression

Yin, Christopher

Model-based design of regulatory DNA for cell type-specific gene expression

Files

Yin_washington_0250E_28741.pdf (36.85 MB)

Date

2025-10-02

relationships.isAuthorOf

Yin, Christopher

Abstract

An important and largely unsolved problem in synthetic biology is how to target gene expression to specific cell types. Enhancers are a class of cis-regulatory element (CRE) that exist in the noncoding regions of the genome and are implicated as major drivers of cell type-specific gene expression, via a complex and incompletely understood sequence grammar. Next Generation Sequencing has enabled the interrogation of this grammar at high throughput via multiple paradigms. Massively Parallel Reporter Assays (MPRAs) directly measure enhancer activity for libraries of up to hundreds of thousands of sequences at once, but are limited in terms of sequence length and experimentally compatible targets. Chromatin accessibility is commonly used as a surrogate metric indicating likely enhancer identity, and can be profiled genome-wide for a far greater range of biological targets compared to MPRAs using techniques such as ATAC-seq or DNase-seq; nonetheless, these cannot provide direct readout of enhancer activity. In this dissertation I explore the capacity for deep learning models trained on either MPRA or chromatin accessibility data to design functional cell type-specific enhancers. In Chapter 2, I establish the viability of both approaches by designing and experimentally validating synthetic enhancers targeted to 2 human cancer cell lines; and in Chapter 3, I build upon this work by training models only on accessibility data, enabling me to take advantage of the greater coverage of biological diversity in accessibility datasets compared to MPRA compendia. I show successful enhancer design in 9/10 human cell lines, confirming the generalizability of this approach; and additionally show in vivo that enhancers targeted to a retinoblastoma line are active in mouse retinas. In both chapters I analyze the sequence determinants of enhancer specificity via enrichment-based and explainable AI techniques, exposing complex combinatorial relationships between discrete sequence elements corresponding to known Transcription Factor Binding Sites (TFBSs). Furthermore, I analyze why enhancers designed to achieve high predicted specific accessibility sometimes fail to exhibit correspondingly specific enhancer activity, which will inform future expansions of this approach. This work shows that model-guided design of enhancers can help us decipher the cis-regulatory code governing cell type specificity and result in novel tools for selective targeting of human cell types.