©Copyright 2021 Sanjay R Srivatsan Multiplex single-cell RNA sequencing for chemical genomics and spatial transcriptomics Sanjay R Srivatsan A dissertation submitted in partial fulfillment of the requirements for the degree of Doctor of Philosophy University of Washington 2021 Reading Committee: Cole Trapnell, Chair Jay Shendure Stephen Tapscott Program Authorized to Offer Degree: Department of Genome Sciences University of Washington Abstract Multiplex single-cell RNA sequencing for chemical genomics and spatial transcriptomics Sanjay R Srivatsan Chair of the Supervisory Committee: Professor Cole Trapnell Genome Sciences Each of us begins life as a single fertilized cell. Following a seemingly predetermined set of cell divisions, the single cell morphs into a rough mass, then a hollowed tube, and finally becomes a recognizable neonatal form. How the information contained within a single cell si- multaneously specifies an organism’s anatomy, the construction of its organs, and the ability to cogitate on this very question, remains one of biology’s open questions. Although centuries of careful experiments devoted to characterizing development have revealed many important genes and mechanisms, the results of these experiments span different model organisms, developmental stages, cell populations and measurement modalities. Integrating this knowledge base into coher- ent representation requires a cellular scaffold that charts an organism’s development over the axes of time and space. Preliminary unified representations of developing organisms (e.g. C. Elegans, Zebrafish and Mouse) have been created by large-scale single cell RNA sequencing (scRNA-seq) efforts. These efforts have characterized the set of intermediates through which differentiating cells transit and have profiled the large number of cell types present in a developing organism. Although scRNA-seq data have proven powerful in cataloging cellular states, they lack crucial context: i) the experimental context afforded by the comparison of multiple conditions (e.g. wild-type vs. perturbation) and ii) a cell’s spatial context, a crucial factor driving its behavior. To address these knowledge gaps, over the course of my PhD I have developed two scRNA-seq technologies: 1) sci- Plex, a generalizable strategy to label cell populations and 2) sci-Space, a methodology to record a cell’s spatial position in conjunction with its single cell transcriptome. (1) First I developed the sci-Plex protocol, an inexpensive and efficient method to label single cells through the chemical fixation of unmodified single stranded oligos to nuclei prior to scRNA- seq library preparation. To demonstrate proof-of-concept of the sci-Plex protocol, I performed a high-throughput, high-content drug screen at single cell resolution in 3 cancer cell lines; effectively conducting 4,500 independent scRNA-seq experiments at once. The resulting dataset enabled characterization of a drug’s potency, class, mechanism of action, and the heterogeneity of cellular responses induced upon drug treatment. For example, our scRNA-seq data showed that histone deacetylase inhibitors likely lead to cell death by trapping valuable acetyl molecules on chromatin. (2) Next, I extended the application of the sci-Plex protocol and developed the sci-Space method to capture spatial information from sectioned tissue. The fast and scalable sci-Space method uses patterned oligonucleotide barcodes in a regular array such that each spot contains a unique set of sequences. Then, to mark each nucleus’ coordinates on the grid, the barcodes are stamped onto a tissue section prior to disaggregation and library preparation. To showcase the power of sci-Space, I collected a dataset comprising over 120,000 cells originating from 14 sections of a single E14 mouse embryo. The resulting data uncovers the genes that drive the devel- oping organism’s body plan and reveals a widespread migration signature within neurons that form the developing brain. These data also provide a quantitative assessment of how cell state relates to spatial position within the developing embryo. Specifically, our estimates indicate that 25% of the variance in gene expression observed is attributable to spatial position. It is my hope that this technology will power the generation of a unified scaffold of development akin to the reference genome. I believe that such a unified representation will be instrumental in amassing data, accel- erating discovery and facilitating translation through the training of machine learning models of cellular state. TABLE OF CONTENTS Page List of Figures . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . iii Chapter 1: Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1 1.1 The Sequencer as a Molecular Counter . . . . . . . . . . . . . . . . . . . . . . . . 1 1.2 The new quantitative biologist . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2 1.3 The right level of abstraction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3 1.4 The genome, cell and development . . . . . . . . . . . . . . . . . . . . . . . . . . 4 1.5 Picking a measurement . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6 1.6 Isolating the cell . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7 1.7 Combinatorial indexing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8 1.8 Sample multiplexing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9 1.9 Developmental atlases and spatial transcriptomics . . . . . . . . . . . . . . . . . . 9 1.10 Closing thought . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10 Chapter 2: Massively multiplex chemical transcriptomics at single cell resolution . . . . 11 2.1 Abstract . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11 2.2 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11 2.3 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13 2.3.1 Nuclear hashing enables multi-sample sci-RNA-seq . . . . . . . . . . . . . 13 2.3.2 sci-Plex enables multiplex chemical transcriptomics at single-cell resolution 14 2.3.3 sci-Plex scales to thousands of samples and enables HTS . . . . . . . . . . 17 2.3.4 Inference of chemical and mechanistic properties of HDAC inhibitors . . . 20 2.4 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24 2.5 Acknowledgments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25 2.6 Materials and Methods . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25 2.6.1 Cell Culture . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25 2.6.2 Compound Preparation . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26 i 2.6.3 Drug treatment . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26 2.6.4 CellTiter Glo . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27 2.6.5 Cell counts of bosutinib exposed cells . . . . . . . . . . . . . . . . . . . . 27 2.6.6 Cancer cell line encyclopedia and connectivity map data and analysis . . . 27 2.6.7 Flow cytometry . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 28 2.6.8 Cell harvest, nuclei isolation and sample hashing . . . . . . . . . . . . . . 28 2.6.9 Preparation of sci-RNA-seq2 libraries . . . . . . . . . . . . . . . . . . . . 29 2.6.10 Preparation of sci-RNA-seq3 libraries . . . . . . . . . . . . . . . . . . . . 30 2.6.11 Preparation of bulk RNA sequencing libraries . . . . . . . . . . . . . . . . 32 2.6.12 Pre-processing of sequencing data . . . . . . . . . . . . . . . . . . . . . . 33 2.6.13 Assigning sample labels from hash reads . . . . . . . . . . . . . . . . . . 34 2.6.14 Dose-response analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . 35 2.6.15 Dimensionality reduction and trajectory analysis . . . . . . . . . . . . . . 35 2.6.16 Estimation of Proliferation Index . . . . . . . . . . . . . . . . . . . . . . . 36 2.6.17 Differential expression analysis . . . . . . . . . . . . . . . . . . . . . . . 36 2.6.18 Pairwise correlation of screened compounds . . . . . . . . . . . . . . . . . 37 2.6.19 Geneset enrichment analysis . . . . . . . . . . . . . . . . . . . . . . . . . 37 2.6.20 Alignment of HDAC inhibitior treated cells . . . . . . . . . . . . . . . . . 38 2.7 Supplemental Figures . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 38 Chapter 3: Embryo-scale, single cell spatial transcriptomics . . . . . . . . . . . . . . . 71 3.1 Abstract . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 71 3.2 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 72 3.3 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 73 3.3.1 Spatial labeling of nuclei with oligonucleotide hashes . . . . . . . . . . . . 73 3.3.2 Spatially-resolved single cell sequencing of the mouse embryo . . . . . . . 74 3.3.3 Contrasting sci-Space and spatial transcriptomics methods relying on pat- terned arrays. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 78 3.3.4 Spatial patterns of gene expression . . . . . . . . . . . . . . . . . . . . . . 79 3.3.5 Quantifying the explanatory power of spatial position . . . . . . . . . . . . 80 3.3.6 Pseudotemporal sci-Space trajectories reflect neuronal migration dynamics 84 3.4 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 85 3.5 Acknowledgments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 88 ii 3.6 Methods . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 89 3.6.1 Overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 89 3.6.2 Creating sci-Space grids for spatial barcoding . . . . . . . . . . . . . . . . 89 3.6.3 Testing blotting concentrations of hash oligos and SYBR green . . . . . . 90 3.6.4 Testing spotted space-grids . . . . . . . . . . . . . . . . . . . . . . . . . . 91 3.6.5 Spatial transfer of oligonucleotides and imaging . . . . . . . . . . . . . . . 92 3.6.6 Single cell RNA sequencing library preparation . . . . . . . . . . . . . . . 93 3.6.7 Pre-processing of sequencing data . . . . . . . . . . . . . . . . . . . . . . 96 3.6.8 Slide registration . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 97 3.6.9 Assigning spatial labels from hash reads . . . . . . . . . . . . . . . . . . . 98 3.6.10 Estimating nuclei counts from embryo images . . . . . . . . . . . . . . . . 100 3.6.11 Cell type classification . . . . . . . . . . . . . . . . . . . . . . . . . . . . 100 3.6.12 Immunostaining and adjacent image alignment . . . . . . . . . . . . . . . 101 3.6.13 Anatomical annotation and segmentation . . . . . . . . . . . . . . . . . . 102 3.6.14 Tissue domains . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 102 3.6.15 Analysis of aggregated spatial positions . . . . . . . . . . . . . . . . . . . 103 3.6.16 Kriging gene expression . . . . . . . . . . . . . . . . . . . . . . . . . . . 104 3.6.17 Spatial autocorrelation analysis . . . . . . . . . . . . . . . . . . . . . . . 104 3.6.18 RNA Fluorescence in situ Hybridization (FISH) and analysis . . . . . . . . 104 3.6.19 Variance decomposition model . . . . . . . . . . . . . . . . . . . . . . . . 106 3.6.20 Pairwise angular distance . . . . . . . . . . . . . . . . . . . . . . . . . . . 108 3.6.21 Spatial gene modules . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 109 3.6.22 Neuronal trajectory analysis . . . . . . . . . . . . . . . . . . . . . . . . . 109 3.7 Supplemental Figures . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 110 Chapter 4: Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 157 4.1 New models for a new era . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 157 4.2 Harmonizing single cell genomics . . . . . . . . . . . . . . . . . . . . . . . . . . 158 4.3 Structure-function . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 158 iii iv LIST OF FIGURES Figure Number Page 2.1 sci-Plex uses polyadenylated single stranded oligonucleotides to label nuclei, en- abling cell hashing and doublet detection . . . . . . . . . . . . . . . . . . . . . . . 15 2.2 sci-Plex enables multiplex chemical transcriptomics at single cell resolution . . . . 16 2.3 sci-Plex enables global transcriptional profiling of thousands of chemical perturba- tions in a single experiment . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19 2.4 HDAC inhibitor trajectory captures cellular heterogeneity in drug response and biochemical affinity. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22 2.5 HDAC inhibitors shared transcriptional response indicative of acetyl CoA depriva- tion. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23 2.6 Hashing with short, polyadenylated single-stranded oligonucleotides enables sta- ble, low-cost labeling of nuclei for sci-RNA-seq and subsequent doublet detection . 39 2.7 sci-Plex distinguishes transcriptional responses of A549 cells to four small molecules and recovers dose-response estimates similar to established assays . . . . . . . . . 40 2.8 Dose-dependent differentially expressed genes (DEG) recover expected transcrip- tional modules. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 41 2.9 Hash-based cell labeling in large-scale sci-Plex experiment. . . . . . . . . . . . . . 42 2.10 Quality control metrics for large-scale sci-Plex experiment. . . . . . . . . . . . . . 43 2.11 Exposing cells to compounds alters their distribution across cell clusters. . . . . . . 44 2.12 Exposing cells to compounds alters their distribution across cell clusters. . . . . . . 45 2.13 sci-Plex identifies pathway-specific enrichment of compounds across UMAP clusters. 46 2.14 Number of dose-dependent differentially expressed genes detected per compound category. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 47 2.15 Correlation of “pseudobulk” sci-Plex with bulk-RNA-seq. . . . . . . . . . . . . . . 48 2.16 Moderated Z scores from the L1000 assay correlate with dose-dependent betas from sci-Plex . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 49 2.17 Single cell measurements reveal variation in proliferation status in vehicle treated cell and across each dose of each drug. . . . . . . . . . . . . . . . . . . . . . . . . 50 v 2.18 Single cell measurements enable estimation of proliferation status and viability across drug-dose combinations. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 51 2.19 sci-Plex enables the dissection of proliferating and non-proliferating cell populations. 52 2.20 sci-Plex screen identifies viability and expression signatures that are reproducible across validation experiments and orthogonal datasets. . . . . . . . . . . . . . . . 53 2.21 Correlation of compound-driven molecular signatures for A549 cells identified in sci-Plex screen. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 54 2.22 Correlation of compound-driven molecular signatures for K562 cells identified in sci-Plex screen. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 55 2.23 Correlation of compound-driven molecular signatures for MCF7 cells identified in sci-Plex screen. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 56 2.24 Clustergrams of the correlation of compound-driven molecular signatures. . . . . . 57 2.25 Pairwise distances between PCA embeddings of drugs based on their dose-dependent effects. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 58 2.26 HDAC inhibitor-treated cell types align and enable joint pseudodose trajectory re- construction. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 59 2.27 Linear models identify pseudodose-dependent modules of proliferation and metabolism. 60 2.28 HDAC inhibitor treatment induces cell cycle arrest in all three cell lines. . . . . . . 61 2.29 HDAC inhibitor exposure leads to sequestration of acetate in the form of acetylated lysines. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 62 2.30 Pseudodose ridge plots. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 63 2.31 Transcriptional trajectory of HDAC inhibitor-treated cells corresponds to in vitro IC50 measurements. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 64 2.32 Supplementation with acetyl-CoA precursors decrease, while inhibition of enzymes that replenish acetyl-coA pools exacerbate, progression along the HDAC inhibitor pseudodose trajectory. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 65 2.33 Correlation of effect sizes between differentially expressed genes post-HDAC in- hibition from original screen vs. new experiment. . . . . . . . . . . . . . . . . . . 66 2.34 Contact inhibition of cell proliferation 72 hours post drug exposure. . . . . . . . . 67 2.35 Aligning A549 cells at 24 and 72 hours after treatment reveals time-dependent responses to diverse small molecules. . . . . . . . . . . . . . . . . . . . . . . . . 68 2.36 Bromodomain inhibition, sirtuin activation, and histone deacetylase inhibition in- duce characteristic transcriptomic responses. . . . . . . . . . . . . . . . . . . . . . 69 2.37 The heterogeneous response to the majority of HDAC inhibitors does not appear to be driven by cellular asynchrony. . . . . . . . . . . . . . . . . . . . . . . . . . 70 vi 3.1 sci-Space recovers single cell transcriptomes while recording spatial coordinates . . 75 3.2 sci-Space captures spatially and cell type resolved gene expression across the embryo 77 3.3 Spatially restricted gene expression for developing neurons . . . . . . . . . . . . . 81 3.4 Quantifying and characterizing the variance in gene expression attributable to spa- tial position. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 83 3.5 Pseudotemporal spatial trajectories capture migratory patterns in the developing brain. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 86 3.6 Comparison of methods for spatial transcriptomics. . . . . . . . . . . . . . . . . . 110 3.7 Comparison of methods for spatial transcriptomics. . . . . . . . . . . . . . . . . . 111 3.8 SYBR green waypoints transfer to DAPI stained embryo. . . . . . . . . . . . . . . 112 3.9 Labeling of cryosectioned tissues with hash oligos from an agarose coated slide is compatible with sci-RNA-seq. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 113 3.10 Spotted space-grids are reproducible. . . . . . . . . . . . . . . . . . . . . . . . . . 114 3.11 Spotted space-grids are reproducible. . . . . . . . . . . . . . . . . . . . . . . . . . 115 3.12 sci-Space workflow for sequencing library preparation and demultiplexing tran- scripts allows transcripts and spatial positions to be assigned to individual nuclei. . 116 3.13 Cellular hashing distinguishes low RNA UMI nuclei from aggregates and uniquely marks a nucleus’ position. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 117 3.14 sci-Space sequenced cells have complex transcriptomes and separate into the major cell types. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 118 3.15 Comparison of recovery of nuclei from sequencing and estimated nuclei present upon imaging. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 119 3.16 Automated cell type annotation and concordance between methods. . . . . . . . . 120 3.17 DAPI stained images of sci-space sequenced slides. . . . . . . . . . . . . . . . . . 121 3.18 Co-registration procedure of imaged section and space-grid. . . . . . . . . . . . . 122 3.19 Spatial distribution of nuclei per slide in addition to key spatial metrics. . . . . . . 123 3.20 Segmented organs from each embryo section. . . . . . . . . . . . . . . . . . . . . 124 3.21 Alignment of immunostaining from adjacent cryosections. . . . . . . . . . . . . . 125 3.22 Spatial position of cell types for Slide 1 from embryo 1. . . . . . . . . . . . . . . . 126 3.23 Spatial position of cell types for Slide 4 from embryo 1. . . . . . . . . . . . . . . . 127 3.24 Spatial position of cell types for Slide 5 from embryo 1. . . . . . . . . . . . . . . . 128 3.25 Spatial position of cell types for Slide 6 from embryo 1. . . . . . . . . . . . . . . . 129 3.26 Spatial position of cell types for Slide 7 from embryo 1. . . . . . . . . . . . . . . . 130 3.27 Spatial position of cell types for Slide 8 from embryo 1. . . . . . . . . . . . . . . . 131 vii 3.28 Spatial position of cell types for Slide 9 from embryo 1. . . . . . . . . . . . . . . . 132 3.29 Spatial position of cell types for Slide 10 from embryo 1. . . . . . . . . . . . . . . 133 3.30 Spatial position of cell types for Slide 11 from embryo 1. . . . . . . . . . . . . . . 134 3.31 Spatial position of cell types for Slide 13 from embryo 1. . . . . . . . . . . . . . . 135 3.32 Spatial position of cell types for Slide 14 from embryo 1. . . . . . . . . . . . . . . 136 3.33 Spatial position of cell types for Slide 2 from embryo 2. . . . . . . . . . . . . . . . 137 3.34 Spatial position of cell types for Slide 3 from embryo 2. . . . . . . . . . . . . . . . 138 3.35 Cell types enriched in annotated anatomical segments. . . . . . . . . . . . . . . . 139 3.36 Tissue domains based on similar cell type compositions are found across the embryo.140 3.37 Comparison of traditional ISH and digital in-situs. . . . . . . . . . . . . . . . . . . 141 3.38 Comparison spatial single cell data and aggregated spatial data. . . . . . . . . . . . 142 3.39 Dissection labels transferred from a developing mouse brain atlas to cells in the sci-Space dataset. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 143 3.40 Transfer of spatial labels from a developing mouse brain atlas to the sci-Space dataset – Slide 14. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 144 3.41 Transfer of spatial labels from a developing mouse brain atlas to the sci-Space dataset – Slide 13. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 145 3.42 Connective tissue progenitors are composed of multiple subtypes with distinct spa- tial distributions. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 146 3.43 Connective tissue progenitors are composed of multiple subtypes with distinct spa- tial distributions. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 147 3.44 Connective tissue progenitors are composed of multiple subtypes with distinct spa- tial distributions. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 148 3.45 Similarity between pairs of transcriptomes as a function of spatial distance. . . . . 149 3.46 Re-examining the effect of lineage on transcriptome in the developing C. elegans dataset using angular distance. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 150 3.47 Variance explained by the cell type label, spatial label, or both. . . . . . . . . . . . 151 3.48 Variance explained by the cell type label, spatial label, or both. . . . . . . . . . . . 152 3.49 Subtypes of chondrocytes are spatially restricted. . . . . . . . . . . . . . . . . . . 153 3.50 Characteristics of neural pseudotemporal trajectory. . . . . . . . . . . . . . . . . . 154 3.51 Spatial location of cortical neural trajectories. . . . . . . . . . . . . . . . . . . . . 155 viii ACKNOWLEDGMENTS I am deeply grateful to the many people it took to make me the person and scientist I am today. I’ve trained at many institutions prior to coming to the University of Washington. I was first ex- posed to science at the Morehouse School of Medicine where I first learned how to do molecular biology experiments as a high schooler. At UC Berkeley in the lab of Steve Martin, Venice Chiueh gave me another chance to become work as a molecular biologist. There I learned how to perform cell culture, subclone plasmid DNA and test whether a candidate gene was oncogenic. During college summers, I was fortunate enough to be able to go home and work in Periasamy Selvaraj’s lab at Emory University. The 9 months of full time research I performed there were truly trans- formative. I learned how to think, how to read, and how to do experiments. Most importantly, I learned how to work. With a head filled with the stories of a younger Selva, I would work late into the night regularly performing western blots, purifying nanoparticles and performing mouse work. None of this could have been possible in the Selvaraj lab without the guidance of two amazing and caring graduate students Jaina Patel and Erica Bozeman. Each of them share with me a great deal of knowledge and provided me with friendship, despite the different stages of life we were in. After college I spent two years at the National Institutes of Health’s Vaccine Research Center (VRC) in the lab of Peter Kwong. There I worked primarily under Tongqing Zhou, who was the lead staff scientist in the lab and a brilliant crystallographer. Tongqing wasn’t easy to please and almost never gave positive feedback. I always thought that this was his way of caring, of being a tough parent, and demanding more from me. Aside from Tongqing, I also received mentorship from the many postdoctoral and staff scientists in Peter’s lab. Two particularly important mentors were Jason McLellan and Marie Pancera who still support me to this day. Beyond the scientific training I received doing protein crystallography at the VRC, one of the main things I walked away ix with was a vision for how science could change the world. The discussions at the VRC were at the cutting edge and had a lofty vision for how the science we did could change the world. My mentors at the University of Washington, Cole Trapnell and Jay Shendure have been phe- nomenal. I joined Cole’s lab first in the summer of 2015. Prior to my rotation with Cole, I had just read about single cell genomics and I knew I wanted to be a part of the resolution revolution that was about to take place. Cole had just joined as a new faculty the summer before, and I remember this because the lab was empty when I started. At the time, my thinking in joining Cole’s group was that I could learn from him, either through my work or osmosis, how one writes software for scientific computing. Although I feel no closer to this goal today, Cole allowed me to contribute my own strengths and natural talents to help further the group’s broader research vision. The free- dom, and trust that he has placed in me, allowed me to make the right mistakes and eventually find success. His enthusiasm and love for science are infectious and it is a pleasure to work with him everyday. Finally, Cole has a true appreciation for cool – an underrated quality in scientists. He knows how to tell a story and leave you wanting to be just like him. There are so many aspects of my current self that feel borrowed from Cole, but I don’t think I’ll ever give them back. Although my interactions with Jay were limited for the first few years, I remember stealing inspiration in the form of little quotes I would hear from him or about him. Over time our relation- ship has deepened, but I still find myself looking forward to the next bit of inspiration or insight that I’m going glean from him. Apart from Jay’s scientific brilliance, his kindness and dedication for treating people with respect are two of his most admirable qualities. This shows in the love he shows to trainees (both in his lab and outside of it) and to his family. Thanks for showing me that a scientific idol can also be a great person. Finally, I want to thank the friends, colleagues, and mentors who supported me at the University of Washington. Thank you to members of the UW MST program and Genome Sciences department for creating an inspiring environment to do science. The wood sculpture of the double helix, and the sequences of the human and mouse genomes still inspire me to this day. Maybe one day, I’ll x also be a part of such an endeavor. Certain individuals made my training complete. Jonathan Packer took time to teach me so much of the command line magic and statistics I know today. His grasp of very difficult concepts and the ability to explain them in many different ways are admirable and formed the bulk of my hands-on computational training in the lab. Riza Daza taught me everything I know about sequencing and library preparation. Her persistence and thorough explanations helped some of the more difficult concepts in genomics get through. Jose McFaline sat right next to me and provided a lot of support both as a mentor and a friend. He has a way of always saying the right things, and has a heart of gold. His capacity to teach and help seems to know no bounds. Lauren Saunders has been one of my closest friends in the lab. It’s hard to put a finger on why we get along so well, but if there’s anyone I would want to spend a day doing molecular biology with, its her. Vijay Ramani was an early influence in my development as a thinker in genomics and I owe him a large thanks for being a good friend. More broadly, all the members of the Shendure and Trapnell labs, past and present have influenced the way I think and contributed in unknown ways to the successes I’ve experienced. I also want to thank my committee members, Steven Henikoff, Stephen Tapscott, and Noah Simon for generously providing expertise, guidance, and perspective. Over the course of completing a dissertation there’s as much that happens outside the lab as inside the lab. Notably, Nick Hasle, Vanessa Gray, Greg Olson, Hugh Haddox, Una Nattermanns, Alex Ford, Ryan Volum, Sam Ainsley, Vineet Parkhe, Gwen Straley, Liza Severs and Matt Severs have made my time in Seattle absolutely sublime. Finally, I thank my family for their infinite understanding, love and support. My parents always emphasized the importance of education, I think that enrolling in a decade long educational pro- gram was the only logical end to that thinking. Lastly, I want to thank my wife Katya Cherukumilli who has been a bedrock of support over the last decade. She’s put up with a lot of late night and weekend shenanigans just so I could clone a plasmid one day faster or sequence an experiment xi one day sooner. She’s picked me up during my lows and helped bring me back to Earth during my highs. Her spirit, loving personality, and commitment to justice still serve as inspiration to me after all these years. xii DEDICATION This work is dedicated to my parents Srivatsan Ramachandran and Rajashree Srivatsan. Amma, you have given me endless, unconditional love. Appa, I hope you’ll think of this work partly as your own, you’ve always been my number one fan, supporter and cheerleader. xiii 1Chapter 1: INTRODUCTION 1.1 THE SEQUENCER AS A MOLECULAR COUNTER In the biological sciences we are in the midst of a great cataloguing effort. With ever-increasing resolution and precision, we are compiling massive amounts of data. This is most apparent in genomics, where the cost of DNA sequencing has fallen precipitously and greatly exceeded the rate of decrease observed by Moore’s law in computing (1). This rule of thumb was the observation that computing power doubled every 2 years for a fixed cost. This drop in computing costs has increased access to silicon-based technologies, which now impact nearly every sector, including transportation, telecommunication, warfare, business, consumer electronics, scientific research, and healthcare. If the biological sciences are to power a comparable paradigm shift, an important milestone will surely be the increased access to DNA sequencing. The completion of the human genome [by the Human Genome Project (HGP) (2) and the Telomere-to-Telomere (T2T) Project (3, 4)] and the genome sequences of other organisms span- ning the tree of life (e.g. mouse (5), chimpanzee (6), zebrafish (7)) furnished a set of scaffolds for mapping short genomic segments. Having the genome sequence as a reference scaffold meant that instead of sequencing long fragments of DNA that needed to be stitched together, shorter “chopped up” fragments could be conveniently sequenced and queried against the reference genome scaf- fold(8). This simple idea has led to a renaissance in the development of new sequencing technolo- gies and assays, which rely on using shorter DNA fragments as proxies for determining whether a particular DNA species was present (9). This thinking was quickly applied to study biology’s central dogma, which resulted in the development of RNA-sequencing (10, 11), ChIP-sequencing (12) and DNAse-seq (13–15). These assays, using short-read next generation DNA sequencing, respectively measure the production of RNAs, the binding of transcription factors to the genome, and the structure of chromatin. By biochemically converting each analyte into a sequencable frag- 2ment that could be mapped to the reference genome, the DNA sequencer was repurposed as a generic molecular counter (16). Put another way, any observation – organic or inorganic – that could be linked to a change in the abundance of a DNA/RNA species, could be measured by DNA sequencing. 1.2 THE NEW QUANTITATIVE BIOLOGIST Another consequence of sequencing genomes was that it laid bare the complexity of the biological algorithms contained within them. Prior to the completion of the human genome sequence, there was a widely held belief that it would explain how genes were regulated and elucidate the mutations that form the genetic basis for hereditary disease. This outlook was due in part, to a string of early successes, which seemed to indicate that specific inborn mutations could explain disease. Furthermore, many of these diseases (such as /al pha-, /beta-thalassemia, and sickle cell disease) exhibited a Mendelian pattern of inheritance – the disease state of an individual could be explained by the presence of dominant or recessive genetic sequences. Today, after sequencing hundreds of thousands of individuals, it is clear that the genetic and molecular basis for more common diseases such as Type 2 Diabetes Mellitus (T2DM) and Coronary Artery Disease (CAD) are not as straightforward. These studies indicate that mutations in a multitude of genes and in stretches of DNA sequence that regulate genes, each make small contributions to cumulative disease risk (17). To illustrate this point, the models that best predict the risk of cardiovascular events are trained on large amounts of genetic data collected at the population scale (18). These quantitative models estimate a polygenic risk score (PRS) by tabulating a weighted sum of different mutations con- tained across many sites in the genome. This risk score can then be used to stratify patients with respect to risk of heart disease. Notably, PRS are agnostic to molecular mechanisms and act purely as a blackbox to predict risk. The benefit of predicting a person’s disease risk is self-evident. Individuals with increased risk for a disease (e.g. coronary artery disease, type 2 diabetes) could receive counseling, custom dietary plans, recommendations for lifestyle modification, and prophylactic pharmaceutical inter- 3ventions. These interventions, all occurring prior to an individual’s trip to a hospital, could inform lifestyle choices and dramatically decrease human suffering associated with disease. This vision of linking genotypic data with phenotypic data (health outcomes) is currently being realized through large national efforts to integrate the healthcare system and population-scale sequencing efforts. For the task of predicting the risk of inheriting hereditary disease, the sequence of the genome is a useful abstraction; the inherited disease risk is materially propagated by the DNA that is stably inherited from generation to generation. Identifying the correct abstraction (or framing) for this problem has allowed for the unification of data collected asynchronously from human populations around the world – an effort that will have compounding rewards. Of note, the reference human genome is essential in this endeavor, because it provides the necessary scaffold. If other disciples are to emulate this effort, the importance of building the correct scaffold for data and information should be noted. 1.3 THE RIGHT LEVEL OF ABSTRACTION Another grand vision in biology is the use of biological parts like tissues and cells in engineering applications. The modeling task underlying this vision is complex. Such a model would have to predict the behavior of a cell (or a group of cells) given an initial state and a set of inputs. However, finding the correct abstraction and representation of the cell is not intuitive. How does one describe what a cell is? One could pose the question, “What does a cell do?” Alternatively, there is an equally compelling line of questioning, “What attributes does a cell have? What molecules does it contain?” These different frames can lead to wildly different representations (or encodings) of a cell. For example, if we were to take the route of describing the molecules contained within a cell, we could imagine modeling the cell at the atomic level. The cell would then be described as the set of molecular structures (atoms and their connectivities) and the locations of these structures within the cell. Correspondingly, a tissue would be a collection of these atomically-resolved cells, with interactions between the cells also resolved. Although this representation of the cell may 4be the most faithful to material reality, physics-based atomic simulations of much less complex systems, (e.g. protein structures or small molecules) have proven difficult to model, because they become computationally intractable for larger atomic assemblies and over meaningful time scales. Furthermore, the training data needed to inform and refine such a model does not exist, due to the technical limitations of current measurement modalities – highly multiplexed imaging, at atomic resolution, in situ (within the cell). Beyond the technical feasibility of an atomically resolved cell, it is unclear how this informa- tion would get encoded and aggregated. The insights from modeling genetic risk emphasize the importance of aggregating large amounts of data; a task that requires a unified scaffold. Second, such a model may be too complex for the cellular phenomena and behaviors that engineers and scientists are interested in harnessing. Building an atomic model of the cell would be akin to analysing the state of every transistor in a computer in order to reconstruct the operating system, programs, and data contained within it. Although this may be theoretically feasible, it would be the wrong way to approach the problem of figuring out what a computer is “doing.” 1.4 THE GENOME, CELL AND DEVELOPMENT Instead of simulating the atoms in the cell, a more practicable abstraction may be to consider the cell as the atomic unit. To draw this analogy out, atoms like Hydrogen and Helium have distinct chemical properties based on the number of protons contained within their nucleus. Analogously, we need comparable measurements of the cell that define certain cell classifications, and are linked to a cell’s behavior. To choose the measurement that would best describe the cell, cogitating a cell’s properties is helpful. In the most simplistic form a cell is a container that compartmentalizes molecules and biochemical reactions, establishing an inside and an outside. This compartmentalization allows the cell to achieve functions that are essential for its own maintenance and self-preservation. Further, as cells replicate they acquire heritable changes and the cells that are most fit for their environment gain an advantage relative to other cells in the population. This description of the cell captures 5the life cycle of prokaryotic cells (i.e. bacteria). To date bacteria have been thoroughly studied, effectively modeled, engineered, and deployed in the world. To model the dynamics and behavior of cells or tissues contained within multicellular organ- isms (like humans), it is also important to consider how these collections of cells came to be. Most multicellular organisms result from the fusion of two gametes (the sperm and the egg in animals). Upon creation of this new organism, this primal cell rapidly divides, going through an orchestrated process of cellular differentiation. Over the course of this process, the organism takes form. The instruction set to specify this process is fully contained within the genome. At the molecular level, cascading sets of genes are differentially expressed, imbuing sets of cells with different attributes and abilities. This internal program unfolding within each cell, is then coordinated with and af- fected by surrounding cells. This process eventually leads to formation of a conscious and sentient organism capable of understanding the surrounding world, adapting to pathogens it encounters, and rebuilding itself upon injury. 1.5 PICKING A MEASUREMENT In the execution of these dynamic programs, the genome plays a central role: it contains the blueprints to build cellular machinery, and it is the substrate upon which cellular computations are performed. In this cyclic loop, genes in the genome are expressed as RNA molecules, which are translated into proteins. Proteins act as the primary molecular machines present within the cell. They enable movement, facilitate cell division, sense the environment, and transduce the cell- relevant signals back to the genome for more computation. This property of proteins – the link between molecular and functional – means that the set of proteins expressed by a cell, could in principle, be a useful representation of that cell. Furthermore, properties include: • The set of proteins is finite • The same set of proteins is shared between cells • A protein’s presence and level of expression is tied to that cell’s function 6These properties make proteins an ideal candidate for measurement and representation of a cell. However, one missing component is the technical ability to make the measurement itself. To date few technologies have been able to faithfully measure the levels of all proteins in a single cell. When these measurements have been made, the scale of these experiments has been severely limited. Fortunately the levels of messenger RNA (mRNA), the protein’s molecular predecessor, corre- late well with the proteins they code for (19). Moreover, mRNA molecules can be converted into sequence-able DNA fragments for quantification by next generation sequencing. This means that the ever dropping cost of sequencing can be brought to bear on profiling cells by sequencing the mRNA they express. 1.6 ISOLATING THE CELL To measure the contents of a cell, the mRNA originating from each cell needs to be tallied. The earliest attempts to perform this assay involved the deposition of individual cells into individual wells of a microwell plate (20). Then, by preparing barcoded sequencing libraries from the mRNA in each well, the corresponding molecules from each cell could be sequenced and assembled into single cells. These early techniques including, MARS-seq (21), SMART-seq (22), demonstrated that single cell transcriptomes (the set of mRNA molecules present in a sample) could resolve individual cell types and characterize the cellular heterogeneity present within a collection of cells. A further increase in the scale of these reactions was accomplished by first segregating cells from one another by encapsulating each cell in an emulsion (23). These protocols, such as dropSeq (24) and inDrops (25), co-encapsulated the cell with a bead containing barcoded reverse transcrip- tion (RT) primers. Because each bead contains a unique RT barcode, all the RNA molecules from a given cell would share a DNA-barcode sequence. The increase in scale achieved by these meth- ods allowed for the measurement of many more single cell transcriptomes. This increase in scale allowed for the detection of rare and novel cell populations present in the sample. 71.7 COMBINATORIAL INDEXING Unlike the methods mentioned before, single cell sequencing by combinatorial indexing (sci-seq), is a method for sequencing single cell transcriptomes without their prior isolation in wells or emul- sions. Combinatorial indexing methods use a series of split-pool barcoding reactions to label co- transiting molecules. For single cell sequencing, the cell membrane or nuclear membrane act as the container that carries the DNA or RNA found within a cell, between barcoding reactions. The sci- seq paradigm has been adapted to capture a number of different molecular measurements includ- ing chromatin accessibility (sci-ATAC-seq) (26), gene expression (sci-RNA-seq) (27), methylation (sci-Met-seq) (28) and 3D genome conformation (sci-Hi-C) (29). Furthermore, because sci-seq assays label single cells via their unique transit through barcod- ing reactions, increasing the number of barcoding reactions at each round increases the space of possible barcodes exponentially. The upshot is that more cells can be sequenced without incurring too many multiplets – two cells that have traveled through the same combination of wells. For example in 2 level sci-RNA-seq experiments, unique barcodes are acquired through an indexed RT reaction followed by an indexed PCR reaction (27). If 96 barcode sequences are used at each level of indexing, there are a total of 9,216 possible unique transits. Adding a third set of barcoding reac- tions (3-level sci-RNA-seq) with 96 additional indices increases the total barcode space to 884,736 possible combinations. This exponential scale that can be achieved by 3-level sci-RNA-seq in the number of barcodes has allowed for the processing of millions of cells from a single experiment (30). The number of cells that can be profiled scales exponentially with the addition of more barcodes or levels of indexing. The sci-seq paradigm enables the profiling of millions of cells in a single experiment. While increasing the barcode space increases the number of single cells that can be resolved in a single experiment, there are limitations in the number of samples that can be processed simultaneously. Currently, each sample must be processed individually and loaded at an equal concentration into the first, barcoded, combinatorial indexing reaction. Using this method 10s to 100s of samples have been processed in parallel at great cost, labor, and experimental complexity. 81.8 SAMPLE MULTIPLEXING Our ability to measure cells quickly exceeded our ability to make experimental contrasts with a single cell readout. To build accurate models of cellular systems, the data provided to train such a model must derive from diverse experimental conditions. To acquire this data, we effectively need ways to read out results from many perturbations with a highly resolved molecular readout. To tackle this problem, several groups have developed ‘cellular hashing’ methods, wherein cells from different samples are labeled and mixed prior to scRNA-seq (31–33). Although this greatly reduces per-sample costs, these hashing approaches require relatively expensive reagents (e.g. antibodies or chemically modified DNA oligos), use cell type-dependent protocols, and employ scRNA-seq platforms with a high per-cell cost. To address these shortcomings, and enable highly multiplexed experiments within a sci-seq framework I developed a hashing strategy (see Chapter 2) that uses unmodified oligonucleotides to label nuclei. 1.9 DEVELOPMENTAL ATLASES AND SPATIAL TRANSCRIPTOMICS Sample multiplexing allows for the measurement of a cell’s response consequent to chemical, genetic or other environmental perturbations. Although these perturbations may capture a cell’s response to a drug or a genetic lesion, they don’t provide insight into the full range of normal cellular states and transitions made by cells. To catalogue the complete set of cellular states, many single cell biologists have turned to study the developing organism. By documenting the set of cellular states and the transitions between them, the goal is to understand the molecular drivers underlying these cell state transitions. To accomplish this feat, multiple groups have performed whole organism sequencing at single cell resolution (27, 30, 34–37), effectively profiling the diverse cellular populations and cellular states present over the course of development. These cellular states and the transitions between them, establish the natural progression of cells in an organism as it develops (38). However, thus far these datasets have been captured on disassociated or disaggregated cells. This means that there is no retention of spatial information in the assay. While there are emerging computational methods 9and cutting edge technologies that attempt to address this problem, very few technologies spatially resolve the single cell transcriptome. As an extension of my work in sample multiplexing, in Chapter 3 I describe the development of an assay called sci-Space, a spatial barcoding technology that is enabled by oligo hashing. Importantly, sci-Space can scale to the same extent as sci-RNA- seq. 1.10 CLOSING THOUGHT To draw a final parallel to the genome projects, single cell biologists can currently profile cells in a shotgun manner akin to the effort of Celera Genomics and Craig Venter (2), however, the definitive scaffold need to interpret this data –the public human genome project – is still missing. The form of such a scaffold and the technologies used to construct it are being developed now. 10 11 Chapter 2: MASSIVELY MULTIPLEX CHEMICAL TRANSCRIP- TOMICS AT SINGLE CELL RESOLUTION Chapter 2 is adapted with minimal modification from: Sanjay R Srivatsan*, Jose´ L. McFaline-Figueroa*, Vijay Ramani*, Lauren Saunders, Junyue Cao, Jonathan Packer, Hannah A. Pliner, Dana L. Jackson, Riza M. Daza, Lena Christiansen, Fan Zhang, Frank Steemers, Jay Shendure, Cole Trapnell (2020) Massively multiplex chemical transcriptomics at single cell resolution. Science 06 January 2020. I contributed to all aspects of the paper, including the conception of the project, performance of the experiments, analysis of the data, and writing the initial manuscript draft. 2.1 ABSTRACT High-throughput chemical screens typically employ coarse assays, e.g. cell survival, limiting what can be learned about mechanisms of action, off-target effects, and heterogeneous responses. Here we introduce sci-Plex, which uses ‘nuclear hashing’ to quantify global transcriptional responses to thousands of independent perturbations at single-cell resolution. As a proof-of-concept, we applied sci-Plex to screen 3 cancer cell lines exposed to 188 compounds. In total, we profiled 650,000 single-cell transcriptomes across 5,000 independent samples in one experiment. Our results re- veal substantial intercellular heterogeneity in response to specific compounds, commonalities in response to families of compounds, and insight into differential properties within families. In par- ticular, our results with HDAC inhibitors support the view that chromatin acts as an important reservoir of acetate in cancer cells. 2.2 INTRODUCTION High-throughput screens (HTS) are a cornerstone of the pharmaceutical drug discovery pipeline (39, 40). Conventional HTS have at least two major limitations. First, the readout of most HTS are restricted to gross cellular phenotypes, e.g. proliferation (41, 42), morphology (43, 44), or a 12 highly specific molecular readout (45, 46). Subtle changes in cell state or gene expression, that might otherwise provide mechanistic insights or reveal off-target effects, are routinely missed. Second, even when HTS are performed in conjunction with more comprehensive molecular phenotyping such as transcriptional profiling (47–50), a limitation of bulk assays is that even cells ostensibly of the same ‘type’ can exhibit heterogeneous responses (51, 52). Such cellular hetero- geneity can be highly relevant in vivo. For example, it remains largely unknown whether the rare subpopulations of cells that survive chemotherapeutics are doing so on the basis of their genetic background, epigenetic state, or some other aspect (53, 54). In principle, single-cell transcriptome sequencing (scRNA-seq) represents a form of high- content molecular phenotyping that could enable HTS to overcome both limitations. However, the per-sample and per-cell costs of most scRNA-seq technologies remain high, precluding even modestly sized screens. Recently, several groups developed ‘cellular hashing’ methods, wherein cells from different samples are molecularly labeled and mixed prior to scRNA-seq. However, cur- rent hashing approaches require relatively expensive reagents (e.g. antibodies (55) or chemically modified DNA oligos (56, 57)), use cell type-dependent protocols (58), and/or employ scRNA-seq platforms with a high per-cell cost. To enable cost-effective HTS with scRNA-seq-based phenotyping, we describe a novel sam- ple labeling (hashing) strategy that relies on labeling nuclei with unmodified single-stranded DNA oligos. Recent improvements in single cell combinatorial indexing (sci-RNA-seq3) have lowered the cost of scRNA-seq library preparation to less than $0.01 per cell, with millions of cells profiled per experiment (59). Here we combine nuclear hashing and sci-RNA-seq into a single workflow for multiplex transcriptomics, termed ‘sci-Plex’. As a proof-of-concept, we apply sci-Plex to per- form a HTS of 3 cancer cell lines, profiling thousands of independent perturbations in a single experiment. We further explore how chemical transcriptomics at single cell resolution can shed light on mechanisms of action. Most notably, we find that gene regulatory changes consequent to treatment with HDAC inhibitors are consistent with the model that they interfere with proliferation by restricting a cell’s ability to draw acetate from chromatin (60, 61). 13 2.3 RESULTS 2.3.1 Nuclear hashing enables multi-sample sci-RNA-seq Single-cell combinatorial indexing (sci-) methods use split-pool barcoding to uniquely label the molecular contents of large numbers of single cells or nuclei (62). Samples can be barcoded by these same indices, e.g. by placing each sample in its own well during reverse transcription in sci- RNA-seq (59, 63), but such enzymatic labeling at the scale of thousands of samples is operationally infeasible and cost-prohibitive. To enable single-cell molecular profiling of a large number of independent samples within a single sci- experiment, we set out to develop a low-cost labeling procedure. We noticed that single-stranded DNA (ssDNA) specifically stained the nuclei of permeabilized cells, but not intact cells (Figure 3.1A; Figure 3.6A). We therefore postulated that a polyadeny- lated ssDNA oligonucleotide could be used to label populations of nuclei in a manner compatible with sci-RNA-seq (Figure 3.1B; Figure 3.6B). To test this concept, we performed a ‘barnyard’ experiment. Human (HEK293T) and mouse cells (NIH3T3) were each separately seeded to 48 wells of a 96-well culture plate. We performed nuclear lysis in the presence of 96 well-specific polyadenylated ssDNA oligos (‘hash oligos’) and fixed the resulting nuclear suspensions with paraformaldehyde. Having labeled or ‘hashed’ the nuclei with a molecular barcode, we pooled nuclei and performed a 2-level sci-RNA-seq experiment. Because the hash oligos were polyadeny- lated, they had the potential to be combinatorially indexed identically to endogenous mRNAs. As intended, we recovered reads corresponding to both endogenous mRNAs (median 4,740 unique molecular identifiers (UMIs) per cell) and hash oligos (median 270 UMIs per cell). We devised a statistical framework to identify the hash oligos associated with each cell at a frequency exceeding background. We observed 99.1% concordance between species assignments based on hash oligos vs. endogenous cellular transcriptomes (Figure 3.1C; Figure 3.6C-F). Addi- tionally, the association of hash oligos and nuclei was stable to a freeze-thaw cycle, highlighting the opportunity to label and store samples (Figure 3.1D; Figure 3.6G,H). These results demonstrate that hash oligos stably label nuclei in a manner that is compatible with sci-RNA-seq. 14 In sci- experiments, ‘collisions’ are instances in which two or more cells are labeled with the same combination of barcodes by chance (62). To evaluate hashing as a means of detecting doublets resulting from collisions, we varied the number of nuclei loaded per PCR well, resulting in a range of predicted collision rates (7-23%) that was well matched by observation (Figure 3.6I). Hash oligos facilitated the identification of the vast majority of interspecies doublets (95.5%) and otherwise undetectable within-species doublets (Figure 3.1E; Figure 3.6J,K). 2.3.2 sci-Plex enables multiplex chemical transcriptomics at single-cell resolution We next evaluated whether nuclear hashing could enable chemical screens, by labeling cells that had undergone a specific perturbation, followed by single-cell transcriptional profiling as a high- content phenotypic assay. We exposed A549, a human lung adenocarcinoma cell line, to one of four compounds dexamethasone (corticosteroid agonist), nutlin-3a (p53-Mdm2 antagonist), BMS- 345541 (inhibitor of NF-κB-dependent transcription), or vorinostat/SAHA (histone deacetylase inhibitor) for 24 hours, across 7 doses in triplicate for a total of 84 drug/dose/replicate combinations and additional vehicle controls (Figure 3.2A; Figure 3.7A). Nuclei from each well were labeled and subjected to sci-RNA-seq2 (Figure 3.7B-D). We used Monocle 3 (59) to visualize these data via Uniform Manifold Approximation and Projection (64) (UMAP) and louvain community detection to identify compound-specific clusters of cells, which were distributed in a dose-dependent manner (Figure 3.2B,C; Figure 3.7E,F). To quantify the “population average” transcriptional response of A549 to each of the four drugs, we modeled each gene’s expression as a function of dose through generalized linear regression. 7,561 genes were sensitive to at least one drug, and 3,189 genes were differentially expressed in response to multiple drugs (Figure 3.8A). These included canonical targets of dexamethasone (Figure 3.2D) and nutlin-3a (Figure 3.2E). Gene ontology analysis of differentially expressed genes revealed the involvement of drug-specific pathways (e.g. hormone signaling for dexamethasone; p53 signaling for nutlin-3a; Figure 3.8B). Additionally, we evaluated whether the number of cells recovered at each concentration could be used to infer toxicity akin to traditional screens. After fitting a 15 Figure 2.1: sci-Plex uses polyadenylated single stranded oligonucleotides to label nuclei, enabling cell hashing and doublet detection. (A) Fluorescent images of permeabilized nuclei after incubation with DAPI (top) and an Alexa-647 conjugated single stranded oligonucleotide (bottom). (B) Overview of sci- Plex. Cells corresponding to different perturbations are lysed in-well, their nuclei labeled with well-specific “hash” oligos, followed by fixation, pooling and sci-RNA-seq. (C) Scatter plot depicting the number of unique molecular identifiers (UMIs) from single cell transcriptomes derived from a mixture of hashed human HEK293T cells and murine NIH3T3 cells. Points colored based on hash oligo assignment. (D) Boxplot depicting the number of mRNA UMIs recovered per cell for fresh vs. frozen human and mouse cell lines. (E) Scatter plot of overloading experiment; axes as in panel C. Identified Hash oligo collisions (red) identify cellular collisions with high sensitivity. response curve to the recovered cellular counts, we inferred a ‘viability score’ from sci-Plex data, a metric which was concordant with gold standard measurements (Figure 3.2F;Figure 3.7G-I). 16 Figure 2.2: sci-Plex enables multiplex chemical transcriptomics at single cell resolution. (A) Cartoon representation depicting compounds and corresponding targets assayed within the pilot sci-Plex experiment. A549 lung adenocarcinoma cells were treated with either vehicle (DMSO or ethanol) or one of four com- pounds (BMS345541, dexamethasone, nutlin-3a or SAHA). (B) UMAP embedding of chemically perturbed A549 cells colored by drug treatment. (C) UMAP embedding of chemically perturbed A549 cells faceted by treatment with cells colored by dose. (D,E) Expression of a canonical (D) glucocorticoid receptor activated (ANGPTL4) and repressed (GDF15) target genes as a function of dexamethasone dose, or (E) p53 target genes as a function of nutlin-3a dose. Y-axes indicate percentage of cells with at least one read correspond- ing to the transcript. (F) Dose-response viability estimates for BMS345541, dexamethasone, nutlin-3a and SAHA-treated A549 cells, based on the relative number of cells recovered at each dose. 17 2.3.3 sci-Plex scales to thousands of samples and enables HTS To assess how sci-Plex scales for HTS, we performed a screen of 188 compounds targeting a di- verse range of enzymes and molecular pathways (Figure 3.3A). Half of this panel was chosen to target transcriptional and epigenetic regulators. The other half was chosen to sample diverse mechanisms of action. We exposed three well-characterized human cancer cell lines (A549 (lung adenocarcinoma), K562 (chronic myelogenous leukemia), MCF7 (mammary adenocarcinoma)) to each of these 188 compounds at four doses (10 nM, 100 nM, 1µ M, 10 µM) in duplicate, ran- domizing compounds and doses across well positions in replicate culture plates. These conditions, together with vehicle controls, accounted for 4,608 of 4,992 independently treated cell populations in this experiment. After treatment we lysed cells to expose nuclei, hashed them with a unique combination of two oligos (Figure 3.9A), and performed sci-RNA-seq3 (59). After sequencing and filtering based on hash purity (Figure 3.9B-F), we obtained transcriptomes for 649,340 single cells, with median mRNA UMI counts of 1,271, 1,071, and 2,407 for A549, K562 and MCF7, respectively (Figure 3.10A). The aggregate expression profiles for each cell type were highly con- cordant between replicate wells (Pearson correlation = 0.99) (Figure 3.10B). Visualizing sci-RNA-seq profiles separately for each cell line revealed compound-specific tran- scriptional responses and patterns that were common to multiple compounds. For each of the cell lines, UMAP projected most cells into a central mass, flanked by smaller clusters (Figure 3.3B). These smaller clusters were largely comprised of cells treated with compounds from only one or two compound classes (Figure 3.11; Figure 3.12A-C). For example, A549 cells treated with tri- amcinolone acetonide, a synthetic GR agonist, were markedly enriched in one such small cluster, comprising 95% of its cells (Fisher’s exact test, FDR < 1%, Figure 3.12D,E). Although many drugs were associated with a seemingly homogenous transcriptional response, we also identified cases in which distinct transcriptional states were induced by the same drug. For example, in A549, the microtubule stabilizing compounds epothilone A and epothilone B were associated with three such focal enrichments, each comprised of cells from both compounds at all 4 doses (Fig- ure 3.12F,G). The cells in each focus were distinct from one another, but transcriptionally similar to 18 other treatments – either a recently identified microtubule destabilizer, rigosertib (65), the SETD8 inhibitor UNC0397, or untreated proliferating cells (Figure 3.12H). We next assessed the effects of each drug on the ’population average’ transcriptome of each cell line. In total, 6,238 genes were differentially expressed in a dose-dependent manner in at least one cell line (FDR < 5%; Figure 3.13;). Bulk RNA-seq measurements collected for 5 compounds, across 4 doses and vehicle agreed with averaged gene expression values and estimated effect sizes across identically treated single cells, although correlations between small effect sizes were dimin- ished (Figure 3.14). Moreover, sci-Plex dose-dependent effect profiles correlated with compound matched L1000 measurements (49) (Figure 3.15). Genes associated with the cell cycle were highly variable across individual cells, and many drugs reduced the fraction of cells that expressed proliferation marker genes (Figure 3.15-Figure 3.17). In principle, scRNA-seq should be able to distinguish shifts in the proportion of cells in distinct transcriptional states from gene regulatory changes within those states. In contrast, bulk transcrip- tome profiling would confound these two signals (Figure 3.18A) (52). We therefore tested for dose-dependent differential expression on subsets of cells corresponding to the same drug but ex- pressing high vs. low levels of proliferation marker genes (Figure 3.18B). Correlation between the dose-dependent effects on the two fractions of each cell type varied across drug classes (Fig- ure 3.18C), with some frankly discordant effects for individual compounds (Figure 3.18D). Viabil- ity analysis, performed as in the pilot experiment, revealed that after drug exposure at the highest dose, only 52 (27%) compounds caused a drop in viability of 50% or more (Figure 3.3C; Fig- ure 3.10C). Amongst the drugs that reduced viability, we observed a higher sensitivity of K562 to the Src/Abl inhibitor bosutinib (Figure 3.3C), a result we confirmed via cell counting (Fig- ure 3.19A). This result is consistent with K562 cells harboring a constitutively active BCR-ABL fusion kinase (66) and an observed increased sensitivity of hematopoietic and lymphoid cancer cell lines to Abl inhibitors (67) (Figure 3.19B). To assess whether each compound elicited similar responses across the three cell lines, we clustered compounds using the effect sizes for dose-dependent genes as loadings in each cell line (Figure 3.20-Figure 3.35). Joint analysis of the three cell lines revealed common and cell type 19 Figure 2.3: sci-Plex enables global transcriptional profiling of thousands of chemical perturbations in a single experiment. (A) Schematic of the large-scale sci-Plex experiment (sci-RNA-seq3). 188 small molecules were tested for their effects on A549, K562 and MCF7 human cell lines, each at 4 doses and in biological replicate, after 24 hours of treatment. The plate positions of doses and drugs were varied between replicates and a median of 100 to 200 cells were recovered per condition. (B) UMAP embeddings of A549, K562 and MCF7 cells in our screen with each cell colored by the pathway targeted by the compound to which a given cell was exposed. To facilitate visualization of significant molecular phenotypes, we added transparency to cells treated with compound/dose combinations that did not appreciably alter the correspond- ing cells’ distribution in UMAP space as compared to vehicle controls (Fisher’s exact test, FDR < 1%). (C) Viability estimates obtained from hash-based counts of nuclei at each dose of selected compounds (bosu- tinib highlighted in red text). Rows represent compound doses increasing from top to bottom and columns represent individual compounds. Annotation bar at top depicts the broad cellular activity targeted by each compound. (D) UMAP embeddings highlighted by treatment with the MEK inhibitor trametinib (red), an HSP90 inhibitor (purple), or vehicle control (gray). (E) HSP90AA1 expression levels in cells exposed to increasing doses of trametinib. Y-axes indicate percentage of cells with at least one read corresponding to the transcript. specific responses to different compounds (Figure 3.36-Figure 3.37). For example, trametinib, a MEK inhibitor, induced a transcriptionally distinct response in MCF7. Inspection of UMAP pro- 20 jections revealed trametinib-treated MCF7 interspersed amongst vehicle controls, reflecting lim- ited effects. In contrast, trametinib-treated A549 and K562 cells, which harbor activating KRAS and ABL mutations (68), respectively, were tightly clustered, consistent with a strong, specific transcriptional response to inhibition of MEK signaling by trametenib (Figure 3.3D). Further, we observed that these A549 and K562 cells appeared proximal to clusters enriched with inhibitors of HSP90, a key chaperone for protein folding (Figure 3.3D). This observation was corroborated by concordant changes in HSP90AA1 expression in Trametinib-treated cells (Figure 3.3E). Analysis of Connectivity Map data (49, 50) revealed further evidence that MEK inhibitors do indeed induce highly similar gene expression signatures to HSP90 perturbations (Figure 3.19C) especially in A549 but not in MCF7 (Figure 3.19D-E). These results are concordant with previous observations of the regulation of HSP90AA1 downstream of MEK signaling (69) and suggests that similarity in single-cell transcriptomes treated with distinct compounds can highlight drugs that target conver- gent molecular pathways. 2.3.4 Inference of chemical and mechanistic properties of HDAC inhibitors For each of the three cell lines, the most prominent compound response was comprised of cells treated with one of seventeen histone deacetylase (HDAC) inhibitors (dark blue in Figure 3.3B;). To assess the similarity of the dose-response trajectories between cell lines, we aligned HDAC- treated cells and vehicle-treated cells from all three cell lines using a mutual-nearest neighbor (MNN) matching approach (70) to produce a consensus HDACi trajectory, which we term ’pseudo- dose’ (analogous to pseudotime (71)) (Figure 3.4B; Figure 3.38). We observed that some HDAC inhibitors induced homogeneous responses, with nearly all cells localized to a relatively narrow range of the HDACi trajectory at each dose (e.g. pracinostat in A549), while other drugs induced much greater cellular heterogeneity (Figure 3.4B; Figure 3.39). Such heterogeneity could be explained by cells executing a defined transcriptional program asynchronously, with the dose of drug that the cells are exposed to modulating the rates of their progression through it. To test this hypothesis, we sequenced the transcriptomes of 64,440 A549 21 cells that were treated for 72 hours with one of 48 compounds, including many of the HDAC inhibitors from the large sci-Plex screen. Upon accounting for confluency dependent cell cycle effects and MNN alignment (Figure 3.40-Figure 3.42), the co-embedded UMAP projection re- vealed new focal concentrations of cells at 72 hours that were not evident at the 24 hour time point, e.g. SRT1024 (Figure 3.43). However, for the majority of HDAC inhibitors tested, we did not observe that cells at a given dose moved farther along an aligned HDAC trajectory at 72 hours (Figure 3.44). This suggests that the dose of many HDAC inhibitors governs the magnitude of a cell’s response rather than its rate of progression and that any observed heterogeneity cannot be attributed solely to asynchrony (Figure 3.44). Next, we assessed whether a given HDAC inhibitor’s target affinity explained its global tran- scriptional response to the compound. We used dose-response models to estimate each compound’s ‘transcriptional EC50 (TC50)’, i.e. the concentration needed to drive a cell halfway across the HDACi pseudodose trajectory (Figure 3.45A). To compare the transcriptionally-derived measures of potency with the biochemical properties of each compound, we collected published IC50 val- ues for each compound from in vitro assays performed on 8 purified HDAC isoforms. With the exception of 2 relatively insoluble compounds, our calculated TC50 values increased as a function of compound IC50 values (Figure 4C; Figure 3.45B-C). To assess the components of the HDAC inhibitor trajectory, we performed differential expres- sion analysis using pseudodose as a continuous covariate. Of the 4,308 genes that were signif- icantly differentially expressed over this consensus trajectory, 2,081 (48%) responded in a cell type-dependent manner, while 942 (22%) exhibited the same pattern in all three cell lines (Fig- ure 3.46A,B). One prominent pattern shared by the three cell lines was an enrichment for genes and pathways indicative of progression towards cell cycle arrest (Figure 3.46C; Figure 3.47A,B). DNA content staining and flow cytometry confirmed that HDAC inhibition resulted in the accumu- lation of cells in the G2/M phase of the cell cycle (72) (Figure 3.47C,D). The shared response to HDAC inhibition included not only cell cycle arrest, but also the altered expression of genes involved in cellular metabolism (Figure 3.46C). Histone acetyltransferases and deacetylases regulate chromatin accessibility and transcription factor activity through the ad- 22 Figure 2.4: HDAC inhibitor trajectory captures cellular heterogeneity in drug response and biochem- ical affinity. (A) Nearest-neighbor alignment and UMAP embedding of transcriptional profiles of cells treated with one of seventeen HDAC inhibitors. Pseudodose root displayed as a red dot. (B) Ridge plots displaying the distribution of cells along pseudodose by dose. Shown for 3 HDAC inhibitors with varying biochemical affinities. (C) Relationship between log transcriptional EC50 (TC50) and average log10(IC50) from in vitro measurements. Asterisks indicate compounds with a solubility below 200 mM (in DMSO) that were not included in the fit. dition or removal of charged acetyl groups (73–75). Acetate, the product of HDAC class I-,II- and IV-mediated histone deacetylation and a precursor to acetyl-CoA, is required for histone acety- lation but also has important roles in metabolic homeostasis (61, 76, 77). Inhibition of nuclear deacetylation limits recycling of chromatin-bound acetyl groups for both catabolic and anabolic processes (77). Accordingly, we observed that HDAC inhibition led to sequestration of acetate in the form of markedly increased acetylated lysine levels after exposure to a 10µM dose of the HDAC inhibitors pracinostat and abexinostat (Figure 3.48). Upon further inspection of pseudodose-dependent genes, we observed that enzymes critical for cytoplasmic acetyl-CoA synthesis from either citrate (ACLY) or acetate (ACSS2) were upregulated (Figure 3.5A). Genes involved in cytoplasmic citrate homeostasis (GLS, IDH1, and ACO1), cit- rate cellular import (SLC13A3) and mitochondrial citrate production and export (CS, SLC25A1) were also upregulated. Upregulation of SIRT2, which deacetylates tubulin, was also observed in response to HDAC inhibition. Together with increases in chromatin-bound acetate, these transcriptional responses suggest 23 a metabolically consequential depletion of cellular acetyl-CoA reserves in HDAC-inhibited cells (Figure 3.5B). To validate this further, we sought to shift the distribution of cells along the HDAC inhibitor trajectory by modulating cellular acetyl-CoA levels. We treated A549 and MCF7 cells with pracinostat in the presence and absence of acetyl-CoA precursors (acetate, pyruvate or citrate) or inhibitors to enzymes (ACLY, ACSS2, or PDH) involved in replenishing acetyl-CoA pools. Af- ter treatment, cells were harvested and processed via sci-Plex and trajectories constructed for each cell line (Figure 3.49-Figure 3.50). In both A549 and MCF7, acetate, pyruvate and citrate supple- mentation were capable of blocking pracinostat-treated cells from reaching the end of the HDACi trajectory (Figure 3.49F,J,H,L). In MCF7, both ACLY and ACSS2 inhibition shifted cells further along the HDACi trajectory, although no such shift was observed in A549 (Figure 3.49G,K,I,M). Taken together, these results suggest that a major feature of the response of cells to HDAC in- hibitors, and possibly their associated toxicity, is the induction of an acetyl-CoA deprived state. Figure 2.5: HDAC inhibitors shared transcriptional response indicative of acetyl CoA deprivation. (A) Heatmap of row-centered and z-scaled gene expression depicting the upregulation of pseudodose-dependent genes involved in cellular carbon metabolism. (B) Cartoon of the roles of genes from panel A in cytoplasmic acetyl-CoA regulation. Red circles indicate acetyl groups. Enzymes shown in gray. Transporters shown in green (FA = fatty acid; Ac-CoA = Acetyl-CoA; C = Citrate). 24 2.4 DISCUSSION Here we present sci-Plex, a massively multiplex platform for single-cell transcriptomics. sci- Plex uses chemical fixation to cost-effectively and irreversibly label nuclei with short, unmodi- fied single-stranded DNA oligos. In the proof-of-concept experiment described here, we applied sci-Plex to quantify the dose-dependent responses of cancer cells to 188 compounds through an assay that is both high content (global transcription) and high resolution (single cell). By profiling several distinct cancer cell lines, we distinguished between shared and cell line-specific molecular responses to each compound. sci-Plex offers some unique advantages over conventional HTS: sci-Plex can distinguish a com- pound’s distinct effects on cellular subsets (including complex in vitro systems e.g. cellular repro- gramming, organoids, synthetic embryos); it can unmask heterogeneity in cellular response to a perturbation; and it can measure how drugs shift the relative proportions of transcriptionally dis- tinct subsets of cells. Highlighting these features, our study provides insight into the mechanism of action of HDAC inhibitors. Specifically, we find that the main transcriptional responses to HDAC inhibitors involve cell cycle arrest and marked shifts in genes related to acetyl-CoA metabolism. For some HDAC inhibitors, we observe clear heterogeneity in responses observed at the single cell level. Although HDAC inhibition is conventionally thought to act through mechanisms directly in- volving chromatin regulation, our data support an alternative model, albeit not a mutually exclusive one, wherein HDAC inhibitors impair growth and proliferation by interfering with a cancer cell’s ability to draw acetate from chromatin (60, 61, 77). As such, variation in cells’ acetate reservoirs is a potential explanation for their heterogeneous responses to HDAC inhibitors. As the cost of single cell sequencing continues to fall, the opportunities for leveraging sci-Plex for basic and applied goals in biomedicine may be substantial. The proof-of-concept experiments described here – nearly 5,000 independent treatments, transcriptional profiling of over 100 single cells per treatment – can potentially be scaled towards a comprehensive, high-resolution atlas of cellular responses to pharmacologic perturbations (e.g. hundreds of cell lines or genetic back- grounds, thousands of compounds, multi-channel single cell profiling, etc.). The ease and low cost 25 of oligo hashing, coupled with the flexibility and exponential scalability of single cell combinato- rial indexing, would facilitate this goal. 2.5 ACKNOWLEDGMENTS We thank members of the Shendure lab, Trapnell lab and others, particularly A. Adey, G. Booth, A. Hill, S. Henikoff, K. Cherukumilli, P. Selvaraj, and T. Zhou for helpful suggestions, discussion and mentorship. D. Prunkard and A. Leith for assistance in flow sorting. Funding: This work was funded by grants from the NIH (DP1HG007811 and R01HG006283 to J.S.; DP2 HD088158 to C.T.), the W. M. Keck Foundation (to C.T. and J.S.), the NSF (DGE-1258485 to S.S). The Paul G. Allen Frontiers Group (to J.S. and C.T.). J.S. is an Investigator of the Howard Hughes Medical Institute. L.C. , FZ, and F.J.S. declare competing financial interests in the form of stock ownership and paid employment by Illumina, Inc. Author contributions: S.R.S., J.L.M., V.R., J.S., C.T. conceived the project; S.R.S., J.L.M., V.R. designed experiments; F.Z, L.C, F.S provided reagents and performed sequencing; S.R.S., J.L.M., C.T. analyzed the data; S.R.S., J.L.M., V.R., J.S., C.T. wrote the manuscript. Competing interests: One or more embodiments of one or more patents and patent applications filed by Illumina and University of Washington may encompass the methods, reagents, and data disclosed in this manuscript. Data and materials availability: Processed and raw data can be downloaded from NCBI GEO (#GSE139944). Code used to perform analyses can be accessed on Zenodo and https://github.com/cole-trapnell-lab/sci-plex. 2.6 MATERIALS AND METHODS 2.6.1 Cell Culture A549 cells and K562 cells were a kind gift from Dr. Robert Bradley (UW) and Dr. David Hawkins (UW), respectively. MCF7 (cat no. HTB-22), NIH3T3 (cat no. CRL-1658) and HEK293T (cat no. CRL-11268) cells were purchased from ATCC. A549 and MCF7 cells were cultured in DMEM (ThermoFisher, 11995073) media supplemented with 10% FBS (ThermoFisher, cat no. 26140079) and 1% penicillin-streptomycin (ThermoFisher, 15140122). K562 cells were cultured in RPMI 26 1640 (Fisher Scientific, cat no. 11-875-119) supplemented with 10% FBS and 1% penicillin- streptomycin and maintained between 0.2-1 x 106 cells/ml. All cells were cultured at 37C with 5% CO2. Adherents cells were split when they reached 90% confluence by washing with DPBS (Life Technologies, cat no. 14190-250), trypsinizing using TryPLE (Fisher Scientific, cat no. 12-604- 039) and split at either 1:4 (MCF7) or 1:10 (A549, NIH3T3 and HEK293T). 2.6.2 Compound Preparation Dexamethasone was purchased from Sigma-Aldrich and resuspended in molecular biology grade ethanol (Fisher Scientific). BMS-345541 (S8044), Vorinostat (S1047), and Nutlin-3a (S8059) were acquired from Selleck Chemicals and resuspended in DMSO (VWR Scientific, 97063-136). Cherry-picked 96-well compound screens were acquired from Selleck Chemicals resuspended to 10 mM in DMSO (Table S2). Compounds were diluted in their respective vehicle to 1000x of their desired treatment concentration and stored at -80C until use. 2.6.3 Drug treatment For 96-well experiments, adherent cells were trypsinized, washed with PBS and plated in tissue culture treated 96 well flat bottom plates T hermoFisherScienti f ic,catno.12−656−66 at 25,000 cells per well in 100 µL of media. Suspension cells were washed with PBS and plated in 96 well V-bottom tissue culture plates (Thermo Fisher Scientific, cat no. 549935) at 25,000 cells per well in 100 µL of media. Cells were allowed to recover for 24 hours before treatment with 1 µL of a 1:10 dilution of the appropriate compound or vehicle in PBS to maintain a vehicle concentration of 0.1% for all wells. Cells were then exposed to small molecules at the specified concentration for either 24 or 72 hours. For experiments where cells were co-treated with HDAC inhibitors and either acetate, pyruvate, citrate, ACSS2 inhibitor, ACLY inhibitor or PDH inhibitor, cells were treated 24 hours after plating and harvested after 24 hours. In this set of experiments, all wells contained a final concentration of 0.2% DMSO to match treatment with both the HDAC inhibitor and inhibitors of metabolic processes. 27 2.6.4 CellTiter Glo A549, MCF7 and K562 cells were seeded in 96 well plates, allowed to attach for 24 hours and treated with BMS345541, dexamethasone, nutlin-3A, SAHA, as described above. 24 hours post treatment, plates were allowed to reach room temperature and viability estimated using the CellTiter- Glo viability assay (Promega) according to manufacturer’s instructions. Luminescence was recorded using a BioTek synergy plate reader. For each drug treatment luminescence readings were normal- ized to the average luminescence intensities of vehicle DMSO treated wells. 2.6.5 Cell counts of bosutinib exposed cells A549, MCF7 and K562 cells were seeded in 12 well plates at 2.8 x 105 cells per well. After 24 hours to allow for A549 and MCF7 attachment, cells were exposed for 24 hours to 0.1, 1 and 10 µM bosutinib or DMSO vehicle control. After treatment, adherent cells were detached using TrypLE or directly resuspended in 1 mL of media and cells counted on a Countess II FL automated cell counter (ThermoFisher). 2.6.6 Cancer cell line encyclopedia and connectivity map data and analysis Pharmacological profiling data was downloaded from the Cancer cell line encyclopedia (CCLE) data portal (https://portals.broadinstitute.org/ccle/data). Data was the isolated and plotted for cell line of haematopoietic and lymphoid, lung and breast tissue origin exposed to the Abl inhibitors AZD0530 and nilotinib. Connectivity map (CMAP) data was downloaded from the CLUE com- mand app in the CMAP data portal. Top connections and connectivity scores were exported be- tween the MEK inhibitor perturbagen class and HSP inhibitor perturbagen class across all cell lines (Summary) or individual cell lines that overlap with our study (A549 and MCF7). Results were then filtered for data from inhibitor exposure. To determine how connectivities change across all vs. individual cell lines, we filtered for the top connections that overlap with the connectivity sum- mary in data from individual cell lines. Connectivity scores were subjected to a threshold value of 90 as in the associated CMAP study (49). 28 2.6.7 Flow cytometry A549 and MCF7 cells were seeded in 6 cm dishes at 1.6 x 106 cells per plate. K562 cells were seeded in T25 cm2 flasks at 1.6 x 106 cells per flask. After 24 hours to allow for A549 and MCF7 attachment cells were exposed for 24 hours to 10 µM abexinostat, 10 µM pracinostat or DMSO as a vehicle control. After treatment cells were harvested as described above, pellets washed twice in PBS, resuspended in 500 µL of cold PBS and fixed by the addition of 5 mL of ice-cold ethanol while vortexing at low speed. Cells were stored at -20C prior to processing for flow cytometry analysis. For flow cytometry, ethanol was removed and fixed cells washed twice with PBS con- taining 1% BSA (PBS-B) and blocked for 1 hour at room temperature. Then, blocking buffer was removed and cells were incubated in PBS containing 1% BSA and 0.1% tryton X-100 (PBS-BT) as well as a 1:500 dilution of mouse anti-acetyl-lysine antibody (cat no. ICP0390, ImmuneChem Pharmaceuticals Inc) for 2 hours at room temperature. After incubation, cells were washed twice with PBS-BT and incubated with goat anti-mouse Alexa-647 in PBS-BT for 1 hour at room tem- perature. Lastly, cells were washed twice with PBS-BT, once with PBS-B and resuspended in PBS-B containing 5 µg/ml Hoechst 33258 (Life Sciences Technologies) to stain the DNA. Then the levels of total acetylated-lysine and DNA content was analyzed by flow cytometry on an LSRII flow cytometer (BD Biosciences). Quantification and downstream analysis was performed using FlowJo10 (FlowJo.LLC). 2.6.8 Cell harvest, nuclei isolation and sample hashing For the harvest of adherent cells, media was removed, and cells were rinsed with 100 µL of DPBS and tryspinized with 50 µL of Tryp-LE for 15 minutes at 37C. Once cells had detached from the culture plate, the reaction was quenched with 150 µL of ice-cold DMEM containing 10% FBS. Cell suspensions were generated by pipetting and the entire volume was transferred to a 96 well V-bottom plate. Cells were then pelleted by centrifugation at 300 x g for 6 minutes, washed with 100 µL of ice-cold DPBS and re-pelleted at 300 x g for 6 minutes. Lysis was conducted in the 96 well V-bottom plate. Following removal of PBS, cell suspensions 29 were lysed and labeled with 50 µL of cold lysis buffer (10 mM Tris-HCl, pH 7.4, 10 mM NaCl, 3 mM MgCl2, 0.1% IGEPAL CA-630) (62) supplemented with 1% Superase RNA Inhibitor and 400 femtomoles of hashing oligo of the form 5’-GTCTCGTGGGCTCGGAGATGTGTATAAGAGACAG- [10bp-barcode]-BAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAA-3’ where B is G, C or T (IDT). For the large compound screen, 500 femtomoles of an additional oligo was used to uniquely index each 96 well treatment plate. After lysis with 3 strokes of multichannel pipette, cells were fixed by addition of 200 µL of fixation buffer (5% Paraformaldehyde, 1.25x PBS ). Nu- clei were then fixed on ice for 15 minutes before pooling into a trough. Nuclei were pooled by plate into a 50 mL conical tube and pelleted by centrifugation at 500 x g for 5 minutes. Subsequently, cells were resuspended in 500 µL of nuclei suspension buffer (NSB; (10 mM Tris-HCl, pH 7.4, 10 mM NaCl, 3 mM MgCl2, 1% Superase RNA Inhibitor, 1% 0.2mg/mL Ultrapure BSA)). Finally, nuclei from all plates were pooled into a single conical tube and nuclei were pelleted by centrifu- gation at 500 x g for 5 minutes. Nuclei were then resuspended in 1mL of NSB and flash frozen into liquid nitrogen in 100 µL aliquots. Nuclei were then stored at -80C until further processing with sci-RNA-seq. 2.6.9 Preparation of sci-RNA-seq2 libraries Frozen nuclei were thawed over ice and spun down at 500g for 5 minutes. Cells were then per- meabilized in permeabilization buffer (NSB + 0.25% Triton-X) for 3 minutes and then spun down. Following another a wash in NSB, two-level sci-RNA-seq libraries prepared as previously de- scribed (63). Briefly, nuclei were pelleted at 500 x g for 5 minutes, and resuspended in 100 µL of NSB. Cell counts were obtained by staining nuclei with 0.4 % trypan blue (Sigma-Aldrich) and counted using a hemocytometer. 5000 nuclei in 2 µL of NSB and 0.25 µL of 10 mM dNTP mix (Thermo Fisher Scientific, cat no. R0193) were then distributed onto a skirted twin.tec 96 well LoBind plate (Fisher Scientific, cat no. 0030129512) after which 1 µL of uniquely indexed oligo- dT (25 µM)(63) was added to every well, incubated at 55C for 5 minutes and placed on ice. 1.75 µL of reverse transcription mix (1µL of Superscript IV first-strand buffer, 0.25 µL of 100 mM 30 DTT, 0.25 µL of Superscript IV and 0.25 µL of RNAseOUT recombinant ribonuclease inhibitor) was then added to every well and plates incubated at 55C for 10 minutes and placed on ice. 5 µL of stop solution (40 mM EDTA, 1 mM spermidine and 0.5% BSA) were added to each well to stop the reaction. Wells were pooled using wide bore tips, and nuclei transferred to a flow cytometry tube through a 0.35 µm filter cap and DAPI added to a final concentration of 3 µM. Pooled nuclei were then sorted on a FACS Aria II cell sorter (BD) at 150 cells per well into 96 well LoBind plates containing 5 µL of EB buffer (Qiagen). After sorting, 0.75 µL of second strand mix (0.5 µL of mRNA second strand synthesis buffer and 0.25 µL of mRNA second strand synthesis enzyme, New England Biolabs) were added to each well, second strand synthesis performed at 16C for 150 minutes. Tagmentation was performed by addition of 5.75 µL of tagmentation mix (0.01 µL of a custom TDE1 enzyme in 5.74µL 2x Nextera TD buffer, Illumina) and plates incubated for 5 minutes at 55C. Reaction was terminated by addition of 12 µL of DNA binding buffer (Zymo) and incubated for 5 minutes at room temperature. 36 µL of Ampure XP beads were added to every well, DNA purified using the standard Ampure XP protocol (Beckman Coulter) eluting with 17 µL of EB buffer and DNA transferred to a new 96 well LoBind plate. For PCR, 2 µL of indexed P5, 2 µL of indexed P7 (63) and 20 µL of NEBNext High-Fidelity master mix (New England Biolabs) were added to each well and PCR performed as follows: 75C for 3 minutes, 98C for 30 seconds and 18 cycles of 98C for 10 seconds, 66C for 30 seconds and 72C for 1 minute followed by a final extension at 72C for 5 minutes. After PCR, all wells were pooled, concentrated using a DNA clean and concentrator kit (Zymo) and purified via a 0.8X Ampure XP cleanup. Final library concentra- tions were determined by Qubit (Invitrogen), libraries visualized using a TapeStation D1000 DNA Screen tape (Agilent) and libraries sequenced on a Nextseq 500 (Illumina) using a high output 75 cycle kit (Read 1: 18 cycles, Read 2: 52 cycles, Index 1: 10 cycles and Index 2: 10 cycles). 2.6.10 Preparation of sci-RNA-seq3 libraries Frozen nuclei were thawed as before and three-level sci-RNA-seq libraries prepared as described in Cao et al. (59). Nuclei were pelleted at 500 x g for 5 minutes, washed three times with NSB 31 and a small aliquot of nuclei stained with 0.4% trypan blue (Sigma-Aldrich) and nuclei counted using a hemocytometer. 80000 nuclei in 22 µL of NSB, 2 µL of 10 mM dNTP mix and were then distributed into a skirted 2 µL of ligation compatible indexed oligo-dT primers were distributed into each well of 96 well LoBind plates, incubated at 55C for 5 minutes and placed on ice. 14 µL of reverse transcription mix (8µL of Superscript IV first-strand buffer, 2 µL of 100 mM DTT, 2 µL of Superscript IV and 2 µL of RNAseOUT recombinant ribonuclease inhibitor) was then added to every well and RT performed on a thermocycler using the following program: 4C for 2 minutes, 10C for 2 minutes, 20C for 2 minutes, 30C for 2 minutes, 40C for 2 minutes, 50 for 2 minutes and 55C for 15 minutes. After RT, 60 µL of nuclei buffer containing BSA (NBB, 10 mM Tris-HCl, pH 7.4, 10 mM NaCl, 3 mM MgCl2 and 1% BSA) were added to each well, nuclei pooled using a wide bore tip, nuclei pelleted by centrifugation at 500 x g for 10 minutes and the supernatant removed. A second round of combinatorial indexing was performed by ligation of indexed primers onto the 5’ end of RT indexed cDNA. Nuclei were resuspended in NSB and 10 µL added to each well of 96 well LoBind plates after which 8 µL of indexed ligation primers were added to each well along with 22 µL of ligation mix (20 µL of Quick ligase buffer and 2 µL of Quick ligase, New England Biolabs). Ligation was then performed at 25C for 10 minutes. After ligation, 60 µL of NBB were added to each well, nuclei pooled using a wide bore tip, another 40 mL of NBB added to the nuclei and nuclei pelleted by centrifugation at 600 x g for 10 minutes and the supernatant removed. Nuclei were then washed once with 5 mL of NBB, resuspended in 4 mL of NBB, multiplets removed by filtering using a 40 µm Flowmi cell strainer (Sigma-Aldrich), nuclei counted and 5000 nuclei were distributed per well into 96 well LoBind plates in a 5 µL volume. Plates containing nuclei were frozen and stored at -80C until further processing. After thawing the frozen plate 5 µL of second strand synthesis mix (3 µL of elution buffer, 1.33 µL mRNA second strand synthesis buffer and 0.66 µL of mRNA second strand synthesis enzyme) were added to each well and incubated at 16C for 3 hours. Tagmentation was performed by addition of 10 µL of tagmentation mix (0.01 µL of a custom TDE1 enzyme in 9.99µL of 2x Nextera TD buffer, Illumina) and plates incubated for 5 minutes at 55C. After tagmentation, 20 µL of DNA binding buffer was added to every well and plates incubated at room temperature for 5 minutes. 40 µL of Ampure XP beads were then 32 added to each well and plates incubated for 5 minutes at room temperature. Upon isolation of beads using a magnetic stand, supernatant was removed and beads were washed twice with 80% ethanol. 10 µL of USER reaction mix (1 µL of 10X USER buffer and 1 µL of USeR enzyme in nuclease-free water, New England Biolabs) was then added to each well and beads resuspended and incubated at 37C for 15 minutes. After incubation, 7 µL of elution buffer were added to each well and supernatant transferred to a new 96 well LoBind plate after binding beads on a magnetic stand. After incubation at 85C for 10 minutes, libraries were generated with 15 cycles of PCR. Following PCR amplification, sequencing library was purified by first concentrating 1mL of PCR library using a 1x Ampure cleanup and then running the resulting product on a 2% agarose gel containing ethidium bromide. Gel was cut to isolate 2 fragments, hash molecules (220bp - 250bp) and RNA library (250bp - 1000bp). Following gel extraction and an additional 1x Ampure cleanup RNA libraries were sequenced on a NovaSeq 6000 (Illumina) (Read 1: 34 bp, Read 2: 100 bp, Index 1: 10 bp and Index 2: 10 bp) and hash libraries were sequenced on a 75 cycle NextSeq (Read 1: 34 bp, Read 2: 38 bp, Index 1: 10 bp and Index 2: 10 bp). 2.6.11 Preparation of bulk RNA sequencing libraries Compound treated cells were first trypsinized and harvested as described previously. Cells were then lysed in V-bottom plates using 26 µL of NSB. 2 µL of 25 µM indexed RT primers were added and annealed at 65C for 5 minutes. Subsequently, RT reaction was performed using the SuperScript IV system, with 8µL of 5x SuperScript Buffer, 2µL of SuperScript IV, 2µL 10 mM dNTP mix, 2µL of 100 mM DTT and 2 µL of RNAseOUT recombinant ribonuclease inhibitor per well.. Reaction was performed for 10 minutes at 55C and subsequently stopped via heat inactivation (80C for 10 minutes). Libraries were then pooled and excess RT primer was removed through either two 0.7x SPRI clean-ups or a single 0.7x SPRI cleanup followed by Exo-1 treatment and inactivation. Double stranded DNA was produced through incubation at 16C for 3 hours with second strand synthesis mix containing 0.5 µL of enzyme and 2 µL of second strand reaction buffer in a final volume of 20 µL. Following second strand synthesis, libraries were tagmented 33 with 1 µL of commercial Nextera reagent with 20.5 µL of 2x TD buffer and . Reactions were stopped with 40 µL of Zymo Clean and Concentrate buffer and incubated at room temperature for 5 minutes. Libraries were subsequently purified with a 1x SPRI cleanup and eluted in 16 µL of elution buffer. Sequencing libraries were generated through PCR with 2 µL of index P7 and P5 primers each and 20 µL of 2x NEB Next Master Mix. Finally, libraries were pooled, purified with a 1x SPRI cleanup and quantified. Libraries were sequenced on a Nextseq 500 (Illumina) using a high output 75 cycle kit (Read 1: 18 cycles, Read 2: 52 cycles, Index 1: 10 cycles and Index 2: 10 cycles). 2.6.12 Pre-processing of sequencing data Sequencing runs were first demultiplexed using bcl2fastq v.2.18. Only barcodes that matched reverse transcription indices within an edit distance of 2 bp were retained. For sci-RNA-seq3 libraries, barcodes which matched both provided reverse transcription indices and ligation indices within an edit distance of 2 bp were retained. Following assignment of indices, polyA tails were trimmed using trim-galore, and reads were mapped to a human transcriptome (hg-38) or human- mouse transcriptome (hg-38 and mm-10) using the STAR aligner. Following alignment, reads were filtered for alignment quality, and duplicates were removed. Reads were considered duplicates if they (1) mapped to the same gene, (2) mapped to the same cell barcode and (3) contained the same unique molecular identifier (UMI). Reads that met the first two criteria, and differed by an edit distance of 1 from a previously observed UMI were also marked as duplicates and discarded. Non-duplicate reads were assigned to genes using bedtools (78) to intersect with an annotated gene model. All 3’ UTRs in the gene model were extended by 100 bp to account for the possibility that some gene 3’ UTR annotations may be too short, causing genic reads to improperly be annotated as intergenic. Cell barcodes were considered to correspond to a bona fide cell if the number of unique reads associated with the barcode was greater than an interactively defined threshold on a knee plot. Reads from cells that passed this UMI count threshold were first aggregated into a sparse matrix format and then loaded and saved as a CDS object for analysis with Monocle3. 34 2.6.13 Assigning sample labels from hash reads Demultiplexed reads that matched combinatorial indexing barcodes were examined to identify hash reads. Reads were considered hash reads when they met two criteria: (1) the first 10 bp of read 2 matched a hash barcode in the experiment within an edit distance of two and (2) contained a polyA track between base pairs 12 to 16 of read 2. These reads were then deduplicated by cell barcode and collapsed by UMIs to create a vector of hash oligo UMI counts for each nucleus in the experiment. To assign each nucleus to the culture well from which it came, we test whether its sci-RNA-seq library is enriched for a particular hash barcode. We compare a nucleus’s hash UMIs against a ‘background distribution’, which under ideal circumstances, would be the uniform distribution. In practice, minor variation in concentrations of hash oligos added to each well of liberated nuclei may necessitate empirically estimating the background. To do so, we simply average the relative hash UMIs from cell indices for which fewer than < mRNA UMIs were collected, reasoning that these reflect library contributions from RT well supernatant, debris fragments, etc. We then compare the hash UMIs for nucleus to this background by a chi-squared test. After correcting the resulting p values for multiple testing by Benjamini-Hochberg, we reject the null hypothesis that originates from the background distribution at specified FDR (5% FDR was used in this study). Those nuclei with hash counts deemed different than background are then evaluated for enrichment for a single hash sequence. Enrichment ratios were calculated as the UMI count ratio of the most abundant vs. the second most abundant hash oligo. Specifically, if the UMI count for the most abundant hash in nucleus is α-fold higher than the second most abundant, is marked as a singleton. α was determined on a per-experiment basis by examining the distribution of these ratios and choosing a value that separated unlabeled cells and singularly labeled cells. Cells that fell below α-fold enrichment of a unique hash oligo were flagged as a multiplet or debris and discarded. 35 2.6.14 Dose-response analysis Dose-response analysis was conducted in R using the drc package (79) by fitting a four-parameter log-logistic model for each drug to the number of cells recovered in the single-cell RNA-seq data at each dose. Cells that survived doublet analysis and QC were grouped by their culture well of origin and counted. These counts were then adjusted to account for variation in recovery as a function of cell type and culture plate as follows. The vector of cell counts across wells were fit with the model The dose response curves enable cells to be annotated according to the impact of their culture conditions on viability. Each cell is assigned a ‘viability score’ which is simply the expected fraction of vehicle cells remaining after exposure to a given dose of a compound. These cell counts are generated via the predict() function of the drc package and then normalized relative to the corresponding vehicle control. 2.6.15 Dimensionality reduction and trajectory analysis Gene expression profiles were visualized with Monocle 3, which uses UMAP to project them into a two or three dimensional space. Briefly, Monocle 3 first calculates size factors for every cell. Size factors were calculated as the log UMI counts observed in a single cell divided by the geometric mean of log UMI counts from all measured cells. After scaling each nucleus’ UMI counts by its library size factor, Monocle3 adds a pseudocount of 1, and log transforms the counts. Next, these log-transformed profiles are projected onto the top 25 principal components. These PCA coordinates were transformed by Monocle 3 (using an approach similar to the removeBatchEffect() function in the limma package (80)) according to the model l˜og(UMIs) + replicate (Figure 3) or ‘ log(UMIs) + viability + proliferation index + replicate’ . Adjusted PCA coordinates for each cell are used to initialize UMAP. Unless otherwise noted, UMAP was run with the following parameters: 50 nearest neighbors, min dist = 0.1, inter-cell distance assessed by cosine similarity. UMAP projection of cells after dual HDAC inhibition and acetyl-COA precursor supplementation or acetyl-CoA generating enzyme inhibition was performed as described with the exception that 36 PCA initialization was performed on the top 1000 most overdispersed genes. Louvain community detection was then performed on this UMAP space using the python package ‘louvain’. Trajectory reconstruction was then performed as described in (59). To determine whether cells exposed to a particular compound/dose combination displayed an enrichment along UMAP space we created contingency tables of the number of compound or vehicle treated cells within and outside clusters and used the stats R package implementation of Fisher’s exact test to test for enrichment. For visualization of drug enrichment in Figure 3B, cells opacity was added to cells under the minimum compound/dose that passed meet an enrichment cutoff of FDR < 1% and a log2 of the odds ratio ¿ 2.5. Cells that passed these filters were used to generate the heatmap of the fraction of enriched cells by cluster in Figure S6. 2.6.16 Estimation of Proliferation Index To obtain an estimate of proliferation index for a single cell, size factor normalized expression of cell cycle marker genes (from Table S5 in Tirosh et. al. Science, 2016) were summed for each cell and logged. Scores were calculated in this way for both G1S and G2M. “Proliferation Index” refers to overall proliferative state of a cell and is calculated as the logged sum of the aggregated G1S and G2M gene expression. 2.6.17 Differential expression analysis To test whether a gene is differentially expressed by a cell line in a dose-dependent manner when exposed to a compound, we fit its (library size-factor adjusted) UMI count recorded from each nucleus with a generalized linear model: Where is a quasipoisson-valued random variable, is the log-transformed dose of the compound being evaluated. We fit these models with Monocle 3, which uses the speedglm package. To fit the regression model for each drug’s effect on each gene, we first identify the subset of cells that are relevant for the model. To determine the effects on gene G in cells of type C when treated with drug D, we include all cells of type C that were treated with any dose of D. To these, we 37 add cells of type C that were treated with the vehicle control. We then fit a model defined above relating the expression level of G across all of these cells. Genes are deemed to be dose-dependent differentially expressed genes (DEGs) if their fitted models include a term that is significantly different from zero as assessed by a Wald test (Benjamini-Hochberg adjusted ). P values for terms are pooled across all compounds and all genes prior to correction for multiple testing. To assess a gene for differential expression as a function of ‘pseudodose’ in the consensus HDAC inhibition trajectory, we fit a model Where is a quasipoisson variable capturing the gene’s UMI counts, encodes the pseudodose values smoothed via a natural spline, is a factor encoding the cell type, and captures the interaction between cell type and pseudodose. The term encodes the (log) dose dependent effects of compound . 2.6.18 Pairwise correlation of screened compounds To identify compounds that result in similar dose-dependent changes to cellular transcriptomes we calculated the Pearson correlation between every pairwise set of compounds. We created a gene by compound matrix for the union of dose-dependent genes across all compounds where each entry is the beta coefficient for the dose dependence term and ten calculated the Pearson correlation for every drug pair using the cor.test() function in the R stats package specifying to use complete observations. The resulting correlation matrix was then hierarchically clustered using the pheatmap package in R. The significance of every pairwise correlation was determined using the corr.test() function from the psych package in R specifying Benjamini-Hochberg as the method for adjusting for multiple hypothesis testing. 2.6.19 Geneset enrichment analysis After fitting a generalized linear model, genes that had significant coefficients (5% FDR threshold) were used for gene set enrichment analysis with the R package piano (81). Briefly, gene sets were ranked according to the set-wide average Wald test statistic corresponding to the generalized linear 38 model term being evaluated with piano’s runGSA() function. Genes were randomized across sets to establish a null distribution for each set’s rank. After 10000 permutations, runGSA() computed p values using the ’mixed’ directional enrichment policy.. The top gene sets, corresponding to those with the largest magnitude enrichment statistic, were chosen for visualization. 2.6.20 Alignment of HDAC inhibitior treated cells To organize cells treated with HDAC inhibitors into a trajectory cells were sampled to equalize the number of cells represented between the three cell lines or between treatments at 24 and 72 hrs. Next, PCA coordinates were computed jointly, and then aligned using the mnnCorrect function from the package scran (70). These adjusted coordinates were used to initialize UMAP in Monocle 3. We then fit a principal graph to the data via lean graph(). To define the origin of the trajectory, we mapped each cell to its nearest principal graph node, and then selected all principal graph nodes for which a majority of mapped cells were treated with vehicle. All other cells’ pseudodoses was measured as the geodesic distance between their nearest principal graph node to an origin node. To quantify the potency of each HDAC inhibitor, we first grouped all cells from each replicate according to treatment and dose, and then computed the mean pseudodose for each cell. We then fit mean pseudodose values as a function of compound concentration using the drc package (79). We used a four-parameter log-logistic model, with the maximal response fixed at the highest pseudodose value achieved across all compounds and doses. We then take the model parameter as described in the ’dose response analysis’ section above as the transcriptional EC50 (TC50) for each compound. 2.7 SUPPLEMENTAL FIGURES 39 Figure 2.6: Hashing with short, polyadenylated single-stranded oligonucleotides enables stable, low- cost labeling of nuclei for sci-RNA-seq and subsequent doublet detection (A) Fluorescent microscopy images demonstrating lack of Alexa 647-conjugated oligo staining (right) of unpermabilized H3-GFP NIH3T3 cells (left). (B) Design of polyadenylated hash oligos (top) and indexed primer used for reverse transcription (bottom). (C) Number of hash UMIs detected per cell. Cells with fewer than 10 hash UMIs (red line) were excluded from further analysis. (D) Distribution of enrichment ratios for cells. Enrichment ratios were calculated as the UMI count ratio of the most abundant vs. the second most abundant hash oligo. An enrichment ratio cutoff of 15 (red line) was used to distinguish doublets vs. singlets. (E) Boxplot of the number of cells recovered per well for each cell line. (F) Layout of culture plate wells with color indicat- ing number of cells recovered and outline indicating cell line. Note that although more NIH3T3 cells were recovered per well, similar numbers of cells were recovered across wells of each cell type. (G) Log-scale per-gene aggregated, size-factor normalized UMI counts recovered from sci-RNA-seq on fresh vs. frozen preparations. Size factors are calculated as the log counts observed in a single cell divided by the geometric mean of log counts from all measured cells. Black line indicates y = x. Red line is the fit with Pearson cor- relation shown. (H) Log-scale boxplot of number of hash UMIs recovered from sci-RNA-seq of HEK293T (human) or NIH3T3 (mouse cells) from fresh vs. frozen preparations. (I) Theoretical (red bars) vs. ob- served (black dots for individual wells and blue bars for means) doublet rate as a function of the number of nuclei sorted into the final plate during sci-RNA-seq. (J) Barnyard plot from Figure 1E after removal of doublets detected by hashing. (K) Log-scale boxplot of number of RNA UMIs in singlet vs. doublet cells, as called based on the purity of hash UMIs. Of note, these are ‘within species’ doublets, i.e. human-human or mouse-mouse, which are not readily detected by conventional barnyard experiments. 40 Figure 2.7: sci-Plex distinguishes transcriptional responses of A549 cells to four small molecules and recovers dose-response estimates similar to established assays A) Experimental layout of A549 cells in 96 well plates. Cells were treated for 24 hours in two 96 well plates using 7 doses (or vehicle) arrayed along each column. B) Cells that contained more than 30 hash oligo UMIs and C) had an enrichment ratio of greater than 10 were retained. D) Retained cells had a median hash UMI count of 78 and median RNA UMI count of 4,681. E) UMAP embedding of chemically perturbed A549 cells, equivalent to Figure 2B but with cells colored by whether they were treated with vehicle or one of the four small molecules. F) UMAP embedding of chemically perturbed A549 cells, equivalent to Figure 2B but with cells colored by cluster as defined using the density peak algorithm in Monocle 3. G) Cartoon depicting how pooling of barcoded nuclei preserves relative cell counts. H) Viability estimates from counting the proportion of recovered hashed nuclei (grey) vs. CellTiter-Glo (red, n = 6). I) Scatter plot of inferred cell counts (x-axis) and CellTiter-Glo viability estimates (y-axis) across all treatments and doses tested (Pearson correlation and chi square test). 41 Figure 2.8: Dose-dependent differentially expressed genes (DEG) recover expected transcriptional modules. A) Upset plot displaying the intersections of dose-dependent DEGs between treatments (ver- tical bars) as well as the total number of dose-dependent DEGs per treatment (horizontal bars). A gene is defined as a dose-dependent DEG if the quasi-poisson regression model relating its expression in a given cell to the dose of drug that cell received shows a significant dose effect (Wald test) after Benjamini-Hochberg correction (FDR < 0.05). See Methods for full details on regression modeling. The four leftmost verti- cal bars correspond to drug-specific dose-dependent DEGs, while the rightmost vertical bar corresponds to dose-dependent DEGs shared by all four drugs. B) Gene set analysis (GSA) performed with dose-dependent DEGs using the runGSA() function from the piano package and the Hallmarks gene set from MSigDB (82). Heatmap color indicates the value of the directional GSA enrichment statistic with values that were capped at either -10 or +10 for visualization. 42 Figure 2.9: Hash-based cell labeling in large-scale sci-Plex experiment. A) Hashing design for sci- Plex with 188 compounds. The experiment used 52 x 96-well plates where each well was marked by a combination of two oligos, one specific to a single 96-well culture plate and another specific to a well within that culture plate. B) Although this could theoretically be implemented with just 96 well hash oligos, we instead used 768, which meant that out of the 39,936 possible pairings of plate and well hash oligos, only a minority (12.5%) of combinations were expected (’legal’), while most were unexpected (’illegal’) C) Observed pairings of plate and well hash oligos were strongly enriched for ’legal’ combinations. D) Scatter plot of HEK293T and NIH3T3 cells seeded in a single RT well of the large-scale sci-Plex experiment. E-H) Hash UMI (panels E,G) and enrichment ratio (panels F,H) cutoffs used for well hash oligos (panels E,F) and plate hash oligos (panels G,H). Enrichment ratio cutoffs corresponds to greater than 5-fold enrichment. Hash UMI cutoffs correspond to 5. 43 Figure 2.10: Quality control metrics for large-scale sci-Plex experiment. A) Log-scale boxplot of num- ber of RNA UMIs for cells that passed hash and RNA UMI cutoff filters for each of three cell lines. B) Correlation of size factor-normalized counts for genes between replicates for each of the three cell lines. Black line indicates y = x. Red line is the fit with Pearson correlation shown. C) Boxplots showing the number of vehicle cells recovered from each of 8 vehicle control wells within each replicate for A549, K562 and MCF7 cells. 44 Figure 2.11: Exposing cells to compounds alters their distribution across cell clusters. Heatmap show- ing the log-transformed ratio of cells treated with a particular drug compared to vehicle control cells in each Louvain community. Columns correspond to clusters in PCA space (see (Figure 3.12A-C) and rows correspond to compounds, annotated by pathway and target. A gray entry denotes a compound that is not significantly enriched or depleted relative to vehicle in the corresponding cluster (Fisher’s exact test, FDR < 1%). 45 Figure 2.12: Exposing cells to compounds alters their distribution across cell clusters. Heatmap show- ing the log-transformed ratio of cells treated with a particular drug compared to vehicle control cells in each Louvain community. Columns correspond to clusters in PCA space (see (Figure 3.12A-C) and rows correspond to compounds, annotated by pathway and target. A gray entry denotes a compound that is not significantly enriched or depleted relative to vehicle in the corresponding cluster (Fisher’s exact test, FDR < 1%). 46 Figure 2.13: sci-Plex identifies pathway-specific enrichment of compounds across UMAP clusters. A- C) UMAP embedding from Figure 3B colored by cells’ assignment to Louvain communities across PCA space for A549 (panel A), K562 (panel B) and MCF7 (panel C) cells. D) UMAP embedding of A549 cells from Figure 3B. Cells treated with the glucocorticoid receptor (GR) agonist triamcinolone acetonide are highlighted in green while all other cells are colored grey. These cells comprise the vast majority (95%) of the cells in cluster 18 from panel A. E) Percent of A549 cells expressing the GR target genes ANGPTL4 and GDF15, as a function of increasing doses of the synthetic GR agonist triamcinolone acetonide. F-H) UMAP embedding of A549 cells colored by cells treated with varying doses of Epothilone A (F), Epothilone B (G), or colored by proliferation index (H). Insets display magnified views of distinct foci induced upon treatment. The treatments with the highest number of cells in each bounding box are indicated in panel H with the number of cells in parentheses. 47 Figure 2.14: Number of dose-dependent differentially expressed genes detected per compound cate- gory. Significant dose-dependent differentially expressed genes (FDR < 0.05) are grouped by cell line and colored by targeted pathway. 48 Figure 2.15: Correlation of “pseudobulk” sci-Plex with bulk-RNA-seq. A) Log10 transcripts per million (TPM) for protein-coding genes measured by bulk RNA-seq (x-axes) vs. size factor-normalized, aggregated single cell profiles for vehicle treated cells from sci-Plex (y-axis). Results are shown for both A549 and K562 cells. Black line indicates the line y = x, while the blue line shows the linear fit with Pearson correlation shown. B) Scatter plots, for selected compounds, comparing statistically significant estimates derived from linear models fit to single cell data (x-axes) vs. estimates derived from bulk RNA-seq using DESeq2 (y- axes). Black line indicates y = x. Blue line is the fit with Pearson correlation shown. 49 Figure 2.16: Moderated Z scores from the L1000 assay correlate with dose-dependent betas from sci- Plex A) For a selected compound-cell line combination (Trichostatin A in MCF7 cells), we plot moderated Z scores from the L1000 assay with treatment for 24 hrs at each of eight doses (y-axes) (49) vs. dose- dependent betas from sci-Plex data (x-axes). All genes that are part of the L1000 assay and significant for dose-dependent effects with sci-Plex (p-value < 0.01) are shown. Line is the fit with Spearman correlation shown. B) Boxplot of Spearman correlations between significant sci-Plex computed dose-dependent betas and L1000 moderated Z-score values from LINCS L1000 data for measured genes at the highest dose in MCF7 cells. Compounds are presented as grouped by the pathway they target. Red point corresponds to Fluvestrant. C) Similar to panel A, but for Fluvestrant in MCF7 cells and at the highest dose (10 µM). D) Similar to panel B, but for A549 cells. Red point corresponds to Triamcinalone Acetonide. E) Similar to panel A, but for Triamcinalone Acetonide in A549 cells and at the highest dose (10 µM). 50 Figure 2.17: Single cell measurements reveal variation in proliferation status in vehicle treated cell and across each dose of each drug.A-C) UMAP projection of A549 (A), K562 (B) and MCF7 (C) colored by proliferation index. High proliferation index indicates an increase in the aggregate expression of tran- scripts that are markers for G1/S phase or G2/M phase (83). (D-F) Density plot of cell cycle distribution for compound-treated cells (blue fill) or vehicle-treated cells (red line). Grey line indicates cutoff used to distin- guish proliferating cells (greater than cutoff) vs. non-proliferating cells (less than cutoff). G-I) Relationship between the percentage of cells designated as low proliferation at each dose of each drug (x-axis) versus the median estimated viability of that combination (y-axis). Each black point corresponds to cells treated with the same dose of a given drug. Red points correspond to vehicle treatment. J) Volcano plot depicting the log2 fold change for significant (q value < 0.01) differentially expressed genes between high and low fractions of vehicle treated cells. 51 Figure 2.18: Single cell measurements enable estimation of proliferation status and viability across drug-dose combinations. Heatmap depicting estimates of relative proliferation rate, the percentage of cells exhibiting low proliferation index, and the estimated viability for each compound (row) at each dose (column) pair. 52 Figure 2.19: sci-Plex enables the dissection of proliferating and non-proliferating cell populations. A) Schematic depicting how changes in cellular state (top) and changes in the relative frequency of subpop- ulations (bottom) look identical upon subjecting the sample to aggregate measures such as bulk RNA-seq. Adapted from ref (52). B,C) Pearson correlations between dose-dependent effect sizes estimated from high vs. low proliferation index cells for each cell line (panel B) and drug class (panel C). D) Per-gene effect sizes estimated from high βdh vs. low βdl proliferation index cells for 4 selected compounds. Effect sizes are expressed as log2 transformed fold changes over intercept. Four classes of genes are shown: those sig- nificant in only high proliferation index cells (green); only low proliferation index cells (purple); both high and low cells, and with concordant effect estimates (red); both high and low cells, but with discordant effect estimates (blue). A drug had concordant dose-dependent effects on gene 〈 in high cells βdh and low cells βdl when |βdh−βdl| was less than 10 percent of 12 |βdh−βdl|. Black line indicates y = x. 53 Figure 2.20: sci-Plex screen identifies viability and expression signatures that are reproducible across validation experiments and orthogonal datasets. A) Cell count viability estimates for K562 (red), A549 (blue) and MCF7 (green) cells exposed to vehicle or increasing doses of the Src/Abl inhibitor bosutinib (n = 6 culture replicates, Wilcoxon rank sum test). For each cell line, cell count values were normalized to the mean cell counts value of vehicle control treated cells. Error bars denote standard error of the mean, n = 8. B) EC50 values for cell lines of haematopoietic and lymphoid, lung and breast tissue origin, for which viability estimates are available from the Cancer Cell Line Encyclopedia (CCLE), exposed to the Abl inhibitors AZD0530 (left panel) or nilotinib (right panel). C-E) Top connectivity scores (a measure that summarizes similarities between transcriptional signatures induced by different drugs (49, 50)) for MEK and HSP inhibitors from the CMAP database across all cell lines (C) or for A549 (D) and MCF7 (E) cells individually. A connectivity score cutoff of +/- 90 was applied as in (49) 54 Figure 2.21: Correlation of compound-driven molecular signatures for A549 cells identified in sci-Plex screen. Heatmap depicts the Pearson correlation of beta coefficients across dose-dependent differentially expressed genes for every pairwise combination of compounds screened. To aid in visualization Pearson correlations were capped at 0.6. 55 Figure 2.22: Correlation of compound-driven molecular signatures for K562 cells identified in sci-Plex screen. Heatmap depicts the Pearson correlation of beta coefficients across dose-dependent differentially expressed genes for every pairwise combination of compounds screened. To aid in visualization Pearson correlations were capped at 0.6. 56 Figure 2.23: Correlation of compound-driven molecular signatures for MCF7 cells identified in sci- Plex screen. Heatmap depicts the Pearson correlation of beta coefficients across dose-dependent differ- entially expressed genes for every pairwise combination of compounds screened. To aid in visualization Pearson correlations were capped at 0.6. 57 Figure 2.24: Clustergrams of the correlation of compound-driven molecular signatures. Clustergrams depicting the Pearson correlation of beta-coefficients across dose-dependent differentially expressed genes for every pairwise combination of compounds screened for A549 (A), K562 (B) and MCF7 (C) cells. Com- pounds names are colored by the pathway targeted. 58 Figure 2.25: Pairwise distances between PCA embeddings of drugs based on their dose-dependent effects. A) Heatmap of pairwise distances between two cell types (columns) for a given drug (rows) in PCA reduced dimensional space. Hierarchically clustered to visualize cell type-specific responses to each drug. B) Insets of highlighted portions of the heatmap with pathway annotation shown to the left. Specific compounds highlighted with a red arrow are shown to the right (C-E) as UMAP embeddings. F) Trametinib treated cell lines are highlighted to illustrate colocalization of A549 and K562. Colored points correspond to labeled compound and all other drugs are shown in gray. Shape encodes the cell line from which each effect profile was captured (squares: MCF7; triangles: K562; circle: A549). 59 Figure 2.26: HDAC inhibitor-treated cell types align and enable joint pseudodose trajectory recon- struction.A) UMAP embedding highlighting the reconstructed pseudodose trajectory over the mutual near- est neighbor-aligned HDAC inhibitor and vehicle treated cells. Root nodes (red points) were chosen as nodes in the principal graph that had over 50% of their nearest neighbors annotated as vehicle treated cells. B) Dis- tribution of each cell line within the embedding. C) Proportion within each pseudodose bin corresponding to each cell line. 60 Figure 2.27: Linear models identify pseudodose-dependent modules of proliferation and metabolism. A) Barplot of the total number of significant dose-dependent and pseudodose-dependent DEGs (FDR < 0.05). B) Upset plot displaying the intersections of significant pseudodose-dependent DEGs between the three cell types. C) Pseudodose heatmap depicting 4,308 genes that varied significantly as a function of pseudodose. Each row corresponds to the expected expression for a gene in the three cell lines as fit by the model described in the ’Differential expression analysis’ section of the Methods. Genes (rows) were scaled and standardized within each cell line before joining the three matrices and performing hierarchical cluster- ing. Clusters from hierarchical clustering were then used as an input into GSAhyper using the Hallmarks geneset. Select genes and genesets characterizing each cluster are shown (right). 61 Figure 2.28: HDAC inhibitor treatment induces cell cycle arrest in all three cell lines. A) Percentage of cells expressing RNA for AURKA and CDKN1A across pseudodose bins. Black bars denote the boot- strapped 95% confidence interval. B) Boxplots depicting the percentage of cells in the low proliferation fraction in at a given drug dose across pseudodose bins. C) DNA content analysis of the three cell lines upon treatment with DMSO (top) or 10µM Abexinostat (bottom). D) Quantification of flow cytometry data depicting the number of cells in each DNA content category. 62 Figure 2.29: HDAC inhibitor exposure leads to sequestration of acetate in the form of acetylated lysines. A) Quantification of flow cytometry measurements of total cellular acetylated lysines in A549 (left panel), MCF7 (middle panel) and K562 (right panel) cells exposed to 10 µM pracinostat, 10 µM p abexinostat or vehicle control. Error bars denote standard deviation of the mean (Wilcoxon rank sum test, n = 3 culture replicates, * p < 0.05, *** p < 0.005). B) Representative flow cytometry histograms for the experiment quantified in panel A. Blue shaded regions and red lines correspond to DMSO vehicle control and 10 µM abexinostat, respectively. 63 Figure 2.30: Pseudodose ridge plots. Ridge plots display the distribution of cells along pseudodose for each HDAC inhibitor and dose combination for compounds that localized to the HDAC trajectory. 64 Figure 2.31: Transcriptional trajectory of HDAC inhibitor-treated cells corresponds to in vitro IC50 measurements.A) Pseudodose response curves were fit for each compound and each cell line using the drc R package. The mean position of each dose along the pseudodose trajectory was used as the response. Two illustrative examples for Belinostat (top) and Trichostatin A (TSA) (below) are shown. Dotted vertical lines illustrate the transcriptional EC50 (TC50) for each compound in each cell line. Shaded gray area denotes the 95% confidence intervals for each TC50 estimate. B) Plot displaying aggregate in vitro measured mean of log10(IC50 [M]) versus log(TC50) colored by solubility supplied by Selleckchem Chemicals. Points displayed as (*) were not used for fits. C) log10(IC50 [M]) versus log(TC50) for each HDAC isoform. Each point is colored by the HDAC inhibitor used. 65 Figure 2.32: Supplementation with acetyl-CoA precursors decrease, while inhibition of enzymes that replenish acetyl-coA pools exacerbate, progression along the HDAC inhibitor pseudodose trajec- tory.A-D) UMAP embeddings of A549 (A,B) and MCF7 (C, D) single cell transcriptomes after exposure to the HDAC inhibitors pracinostat or abexinostat, in the presence or absence of acetyl-CoA precursors or inhibitors to enzymes that replenish acetyl-CoA pools. UMAP were constructed from cells from all con- ditions in the experiment. Cells are colored by pseudodose bin (A,C) or dose (B,D). E) Venn diagram of the overlap of differentially expressed genes across trajectories between or original HDACi trajectory vs. A549 or MCF7 HDACi trajectories from this new experiment. F,H) Boxplots of pseudodose estimates for select conditions of cells exposed to 1 or 10 µM pracinostat with or without co-treatment with acetyl-coA precursors for A549 (H) or MCF7 (L) cells. Values are normalized to vehicle treated cells. Wilcoxon rank sum test. G,I) Boxplots of pseudodose estimates for select conditions of cells exposed to vehicle and pra- cinostat with or without co-treatment with acetyl-coA precursors for A549 (I) or MCF7 (M) cells. Values were normalized to vehicle treated cells. Wilcoxon rank sum test. J,L) Heatmaps depicting the fraction of cells per pseudodose bin for cells exposed to various acetyl-coA precursors in pracinostat-exposed A549 (F) or MCF7 (J) cell. K,M) Heatmaps depicting the fraction of cells per pseudodose bin for cells exposed to various inhibitors targeting enzymes that replenish acetyl-coA pools in pracinostat-exposed A549 (G) and MCF7 (K) cells. 66 Figure 2.33: Correlation of effect sizes between differentially expressed genes post-HDAC inhibition from original screen vs. new experiment. A-B) Correlation of effect size estimates (beta coefficients) for differentially expressed genes between vehicle control and 10 µM abexinostat (A) or 10 µM pracinostat (B) for A549 cells. C-D) Correlation of effect size estimates (beta coefficients) for differentially expressed genes between vehicle control and 10 µM abexinostat (C) or 10 µM pracinostat (D) for MCF7 cells. X-axes correspond to large-scale sci-Plex experiment. Y-axes correspond to targeted followup sci-Plex experiment. 67 Figure 2.34: Contact inhibition of cell proliferation 72 hours post drug exposure. Representative bright- field images of A549 cells exposed to vehicle (A) or the specified dose of the SIRT1 activator SRT2104 (B) or the HDAC inhibitor Abexinostat (C). Viability estimates as determined by recovered cell counts for each drug/dose combination normalized to cell counts of vehicle control wells. 68 Figure 2.35: Aligning A549 cells at 24 and 72 hours after treatment reveals time-dependent responses to diverse small molecules.A-C) UMAP embedding of A549 cells at 24 and 72 hours post treatment in the absence of a correction for differences in viability and proliferation (A), after linear transformation of the data to account for changes in proliferation index and viability (B) and after mutual nearest neighbor based alignment of data after linear transformation (C). Cells are colored by the time point at which they were collected. (D-F) UMAP embeddings as in panels A-C with cells colored by the aggregated normalized expression score of G1/S marker genes. (G-I) UMAP embeddings as in panels A-C with cells colored by the aggregated normalized expression score of G2/M marker genes. (J-L) UMAP embeddings as in panels A-C with cells colored by proliferation index. (M-O) UMAP embeddings as in panels A-C only visualizing cells treated with vehicle control. (P) UMAP embeddings from panel C with cells colored as to the pathway targeted by the treatment to which they were exposed. (Q) Proportion of cells broken up by pathway targeted. Note that only a subset of our 188 compounds across a limited number of pathways were tested at 72 hours. (R) Proportion of cells broken up by the activity targeted by treatment with epigenetic regulation compounds. (S) Proportion of cells broken up by HDAC compound. 69 Figure 2.36: Bromodomain inhibition, sirtuin activation, and histone deacetylase inhibition induce characteristic transcriptomic responses.A-D) UMAP embedding of MNN aligned A549 cells 24 and 72 hours after treatment with the pan-HDAC inhibitors abexinostat (A) or belinostat (B), the bromodomain inhibitor JQ1 (C), and the SIRT1 activator SRT2104 (D). Cells are colored by the dose to which each cell was exposed. 70 Figure 2.37: The heterogeneous response to the majority of HDAC inhibitors does not appear to be driven by cellular asynchrony. A) Aligned UMAP embeddings of cells exposed to vehicle HDAC in- hibitors for 24 or 72 hours. Cells are colored by their progression along pseudodose. B) Aligned UMAP embeddings of cells exposed to vehicle (grey cells) or the labeled HDAC inhibitor for 24 (red cells) or 72 (blue cells) hours. C) Pseudodose position of cells upon treatment with the indicated HDAC inhibitor. 72 hour time point is shown in color, while 24 hours is shown in transparent gray. 71 Chapter 3: EMBRYO-SCALE, SINGLE CELL SPATIAL TRAN- SCRIPTOMICS Chapter 3 is adapted with minimal modification from: Sanjay R. Srivatsan*, Mary C. Regier*, Eliza Barkan, Jennifer M. Franks, Jonathan S. Packer, Parker Grosjean, Madeleine Duran, Sarah Saxton, Jon J Ladd, Malte Spielmann, Carlos Lois, Paul D. Lampe, Jay Shendure, Kelly R. Stevens, Cole Trapnell Science 2021(in press). I contributed to all aspects of the paper, including the conception of the project, performance of the experiments, analysis of the data, and writing the initial manuscript draft. 3.1 ABSTRACT Spatial patterns of gene expression span many scales, and are shaped by both local (e.g. cell-cell interactions) and global (e.g. tissue, organ) context. However, most in situ methods for profiling gene expression either average local contexts or are restricted to limited fields of view. Here we introduce sci-Space, a scale-flexible method that retains single cell resolution while resolving spa- tial heterogeneity in gene expression at larger scales. As a proof-of-concept, we apply sci-Space to whole mount sections of developing mouse embryos (E14.0), capturing the approximate spa- tial coordinates and whole transcriptomes of 120,000 profiled nuclei. We identify thousands of genes that are expressed in an anatomically patterned manner, including both known (e.g. Hox genes) and novel factors, and find that individual cell types vary substantially in the extent to which they exhibit spatial patterning of gene expression at this scale. In both the nervous system and mesenchyme, spatial information facilitates the annotation of cellular subtypes. Finally, in the developing mouse brain, we identify strong correlations between pseudotime and three spatially distinct migratory patterns of differentiating neurons. Looking forward, sci-Space presents a path for the construction of spatially-resolved single cell atlases of mammalian development. 72 3.2 INTRODUCTION Methods for molecular profiling at single cell resolution have the potential to transform our un- derstanding of developmental biology. For example, in diverse models of embryogenesis, we and others have performed “whole organism” profiling of transcription or chromatin accessibility at single cell resolution (84–90). Although the clear majority of major cell types have previously been described, such studies yield a much richer view of cell states during development than was previously available. For example, applying machine learning techniques to organize cells in terms of their progression through development has revealed complex branching trajectories similar to, but topologically distinct from, cell lineage histories (91, 92). The spatial organization of cells plays a central role in normal development and homeostasis, as well as in the pathophysiology of many diseases. However, a key limitation of widely used methods for single cell molecular profiling is that they operate on disaggregated cells or nuclei. Although several groups have developed powerful in situ methods for measuring the expression of many or all genes while retaining spatial information, these methods are not without limitations (Figure 3.6). A first class of methods, including the original “spatial transcriptomics” (93) and Slide-seq (94) methods, barcode and then count mRNAs derived from each spot across patterned arrays. However, although such methods can potentially be implemented at a range of spatial scales, a key limitation is that the boundaries of spots have no natural correspondence to the boundaries of cells. As such, they yield aggregate profiles of small regions encompassing multiple cells and/or portions of cells, rather than resolving individual cells. A second class of methods, including MERFISH (95), seqFISH (96), and FISSEQ (97), rely on in situ hybridization or sequencing to measure the expression of many genes while retaining single cell (or even subcellular) resolution within each field of view. However, such multiplex in situ methods are typically limited by long image acquisition times and complex instrumentation requirements. In sum, the tradeoffs between these two classes of techniques is such that assaying the whole transcriptomes of individual cells over large regions remains impractical. We sought to develop a spatial transcriptomics method that retained both single cell resolution 73 (that is, the ability to unambiguously ascertain sets of transcripts expressed in the same cell) and the flexibility to capture positional information at broader scales (e.g. to detect spatial patterns of cell type-specific gene expression that would be found at the level of a whole organ or embryo, but missed by high-resolution analysis of a small region). To this end, we hypothesized that we could label cells with molecular tags that encoded their approximate spatial coordinates within entire tissue sections, and subsequently recover this information without sacrificing single cell resolu- tion. Furthermore, sampling single cells from across a tissue could reveal spatial patterns of gene expression, including those that are cell type-specific, without comprehensively (and expensively) sequencing the whole transcriptome of every cell. 3.3 RESULTS 3.3.1 Spatial labeling of nuclei with oligonucleotide hashes We recently developed sci-Plex, a method for labeling or “hashing” nuclei using unmodified DNA oligos during single cell RNA-seq with combinatorial indexing (sci-RNA-seq) (98). sci-Plex en- ables the pooling of nuclei from many different specimens or samples into one sci-RNA-seq exper- iment at minimal marginal cost. To leverage parts of this workflow to capture spatial information, we reasoned that we could first print unique combinations of hashing oligos to a spatially patterned array, and then transfer those oligos to the nuclei of a tissue slice by diffusion (99). Provided that these hashing oligos were recovered in association with single cell RNA-seq profiles, they could be used to reconstruct each cell’s approximate tissue coordinates upon sequencing. As a proof-of-concept of this method, which we call “sci-Space”, hashing oligos were spotted onto glass slides coated with a thin layer of dried agarose in a gridded format. These spatial grids contained 7,056 uniquely barcoded spots in a 18mm by 18mm grid with a mean radius of 73.2 ± 14.1µm and a mean spot-to-spot center distance of 222 ± 7.5µm (Figure 3.7). About 5% of spots, constituting an identifiable pattern, were also loaded with SYBR green fluorescent dye. After transferring the oligos from the slide to the tissue, the grid could be registered with an image of the tissue using these concurrently imaged fluorescent fiduciary points (Figure 3.7, Figure 3.8). 74 Next, we optimized hash oligo concentrations to robustly label nuclei from sectioned tissue, developed a protocol for blotting the hash-oligos onto the tissue, and verified that we could disso- ciate, recover, and sequence single cell transcriptomes from labeled nuclei from cultured cell lines or sectioned tissue affixed to a glass slide (Figure 3.9, Figure 3.10). To maximize the recovery of cells sequenced from each slide, we optimized our protocol to thoroughly dissociate cells, and altered the procedure for seeding sci-RNA-seq wells (FACS → dilution). After these optimiza- tions, we arrived at the sci-Space protocol, which consists of four steps: 1) fresh-frozen tissue is sectioned; 2) sectioned tissue is permeabilized and physically juxtaposed to a glass slide bearing the spatially gridded hashing oligos; 3) during oligo transfer, the spatial grid is imaged; and 4) nu- clei from the tissue on the slide are extracted, further hashed with a slide-specific oligo, chemically fixed and subjected to sci-RNA-seq (Figure 3.1A; Figure 3.11, Figure 3.12). 3.3.2 Spatially-resolved single cell sequencing of the mouse embryo To apply sci-Space, we profiled fourteen sagittal sections derived from two E14.0 mouse em- bryos (C57BL/6N). After sequencing, conservative doublet removal (100), quality filtering by RNA UMIs, and assignment of each cell to a slide (based on its slide-specific hash oligo), our dataset comprised 121,909 spatially resolved single cell transcriptomes, with an average of 2,514 unique molecular identifiers (UMIs) per cell, 1,231 genes detected per cell, and without apparent batch effects between slides or embryos (Figure 3.13, Figure 3.14). This corresponds to capture of 164 nuclei/mm2 of tissue on average, or sampling of 2.2% of the estimated nuclei present (Fig- ure 3.15). Rather than annotating cell types ab initio, we co-embedded (101, 102) these data with published, non-spatial sci-RNA-seq “mouse organogenesis cell atlas” (MOCA) data spanning E9.5 to E13.5 (84). Reassuringly, these E14.0 data distributed to the distal tips of various E9.5 to E13.5 trajectories (Figure 3.1B). An initial set of cell type annotations were inferred by nearest neighbor label transfer between the E13.5 time point of the MOCA dataset and the sci-Space dataset. We also transferred annotations from a developing mouse brain atlas (DMBA) spanning E13.5 to E14.5 (103). The cell type annotations inferred by nearest neighbor label transfer were highly concordant 75 Figure 3.1: sci-Space recovers single cell transcriptomes while recording spatial coordinates. (A)Fresh- frozen, sectioned tissues are exposed to a grid of arrayed single-stranded oligos, a subset of which are fluorescent waypoints, and then imaged. The oligo transfer and imaging procedure takes 3 to 5 minutes per slide. The cells from each slide are labeled with an additional, section-identifying barcode prior to pooling and sci-RNA-seq. (B) Joint embedding of E14.0 single cell transcriptomes from this study (yellow) and published data from single cell transcriptomes spanning E9.5 to E13.5 (84). Major trajectories are labeled. (C) UMAP embedding of 121,909 cells from sectioned E14.0 mouse embryos. Cell types are denoted by color and number. 76 with those derived from an orthogonal approach using Garnett (104), a semi-supervised method for annotating single cell data (Figure 3.16). Finally, these annotations were refined by manual inspection of differentially expressed genes (Figure 3.1C). Images of sectioned embryos and sequencing data were co-registered by matching the SYBR waypoints in the oligo grid to identifiable spots of SYBR staining in the image (105, 106) (Fig- ure 3.2A; Figure 3.17, Figure 3.18). This alignment yielded an affine transformation from grid positions to the image, from which cell coordinates were then calculated within the sectioned tis- sue (Figure 3.18). Nuclei were mapped to the position with the highest combination of spot and sector oligos within the imaged section. For about 9% of nuclei, the top assignment was not lo- cated near any other nuclei of the same cell type; for such “outliers”, alternative mappings were considered (Methods). Altogether, 121,909 nuclei were well-localized (Figure 3.13C-F) to one of 15,102 spatial positions across 14 sections; on average, each spatial position was assigned of 8.1 nuclei (10.5 s.d.) that when summed, included 20,296 UMIs (29,465 s.d) and 4,428 genes (3,251 s.d) (Figure 3.19). We next quantified the distribution of cell type annotations across each section. First, each slide’s image was segmented into readily discernible organs (Figure 3.2B; Figure 3.20), a pro- cess aided by immunostaining and alignment of adjacent sections (107) (Figure 3.21). As ex- pected, neurons mapped largely within the spinal cord and brain outlined by cells of the developing meninges, cardiomyocytes within the heart, and white blood cells throughout the organism (Fig- ure 3.2C; Figure 3.22). Although several cell types mapped almost exclusively to one anatomic segment (e.g. hepatocytes to liver), others mapped more broadly (e.g. connective tissue pro- genitors) (Figure 3.2D; Figure 3.22, Figure 3.35). Analysis with Giotto (108), an unsupervised tool for segmenting spatial transcriptomic images into tissue “domains” of similar cell type com- position, revealed 22 domains shared across sci-Space slides. In addition to detecting boundaries between major organs, Giotto was able to automatically recognize more complex domains with dis- tributions extending throughout the embryo (e.g. mesenchymal tissue and cartilage) (Figure 3.36). Finally, wherever cells are captured, sci-Space data enables the visualization of any gene in the transcriptome akin to an in situ hybridization, albeit at lower spatial resolution. For example, a 77 sci-Space “digital in situ” of the dopamine transporter Slc6a3 highlights a cluster of dopaminergic neurons at the midbrain-hindbrain boundary, consistent with stage- and section- matched whole- mount in situs (Figure 3.2E); additional such examples are shown in Figure 3.37. However, unlike conventional in situ data, sci-Space data also resolves gene expression attributable to different cell types across the embryo. For example, in the heart, both cardiomyocytes and endothelial cells in the heart express the growth factor Fgf1, while only cardiomyocytes express the growth factor receptors Fgfr1, Fgfr2 and Fgfr3 (Figure 3.2F). Figure 3.2: sci-Space captures spatially and cell type resolved gene expression across the embryo. (A) Co-registered DAPI stained section image and oligo array, superimposed. SYBR waypoints are highlighted as green spots. (B) Anatomical regions of Slide 1 (left) and an adjacent immunostained serial section aligned to Slide 1 (right). (C) Highlighted cell types mapping to a single slide. (D) UMAP embedding colored by anatomical regions. (E) Gene expression of dopamine transporter Slc6a3 from sci-Space data (left) and published (109) section/stage matched in situ (right). (F) Smoothed percentage of gene expression for Fgf1, Fgfr1, Fgfr2 and Fgfr3 in either cardiomyocytes (top) or endothelial cells (bottom). Color scaled to maximum percentage within each gene. Scale bars in panels (A-C) = 0.5 mm. 78 3.3.3 Contrasting sci-Space and spatial transcriptomics methods relying on patterned arrays. A key distinction between sci-Space and STC (spatial transcript capture; Figure 3.6, left) such as Slide-seq, is that because STC methods capture transcripts from lysed tissue sections (93, 94), each spot can include RNA from multiple cells and/or portions of cells. As such, STC methods are limited in their ability to resolve gene expression variation within individual cells or cell types. To quantify the consequences of this limitation, we aggregated transcriptomes from sci-Space cells derived from the same spatial positions, effectively creating a “mock-STC” dataset. Cell types and clusters readily discernible in the single cell data were hard to resolve or lost entirely in the mock- STC data Figure 3.38A-C). One approach to ameliorate this limitation involves mapping cells profiled by non-spatial single-cell RNA-seq onto the spatial “scaffolds” afforded by STC, thereby imputing cellular locations within a tissue (102, 110). To assess the viability of this approach, we performed such imputation on the mock-STC data, and compared it to the single nucleus data used to derive it Figure 3.38D,E) (102). The mean distance between a nucleus’ imputed position and its measured position was 11.6 spots ( 2.5mm). The discrepancy between imputed and measured positions was greater for cell types found throughout the embryo (e.g. skeletal muscle) than for those present in a single organ (e.g. cardiomyocytes) (Figure 3.38F). Thus, despite sampling fewer transcripts than STC methods Figure 3.38G-I), the retention of single cell resolution by sci-Space (that is, the ability to unambiguously ascertain sets of transcripts expressed in the same cell) affords a key advantage. Analogously to what has been done with STC methods, but with the advantage of single cell resolution throughout, we posited that sci-Space data could serve as a scaffold for the imputation of locations of cells profiled by non-spatial single-cell RNA-seq. To test this, we aligned neurons from the developing mouse brain atlas (DMBA) to sci-Space neurons. Reassuringly, transfer of coarse DMBA anatomic annotations (from dissection) were consistent with measured sci-Space coordinates Figure 3.40). Furthermore, this alignment mapped hundreds of DMBA transcriptional clusters to specific positions, many of which were spatially restricted Figure 3.40). For example, of the 193 UMAP clusters from La Manno and colleagues(103) that mapped to slides 13 and 14, 79 94 and 115 clusters displayed statistically significant focal enrichment, respectively (FDR < 0.01; Getis-Ord Local G). 3.3.4 Spatial patterns of gene expression To systematically examine these data for spatially patterned, cell type-specific gene expression across the E14.0 embryo, we quantified spatial autocorrelation – the degree to which the cells expressing a given gene are spatially proximate. Testing each annotated cell type separately, we identified hundreds to thousands of genes exhibiting positive spatial autocorrelation per cell type (Moran’s test, FDR < 0.001). Amongst the cell types analyzed, connective tissue progenitors and neurons had the most spatially autocorrelated genes detected (Figure 3.3A; mean 12,150 ± 2,270 and 8,623 ± 3,846 genes per slide, respectively). A “trivial” explanation for such spatial autocorrelation of genes within a cell type would be the presence of spatially-restricted, unannotated cell subtypes. For example, upon sub-clustering of connective tissue progenitors, we can indeed find spatially restricted cell subtypes (Figure 3.42A,B), such that the genes defining these subtypes are naturally expected to be spatially autocorrelated. Alternatively, a gene’s spatial pattern of expression could also arise from spatially restricted gene expression contributed by multiple cell subtypes. To distinguish between these scenarios, we cal- culated a gene’s spatial autocorrelation (Spatial Moran’s I) and compared this value to its autocor- relation in UMAP space (UMAP Moran’s I) – a proxy for gene expression driven by a single cell subtype. Across all genes, the two measures were reasonably well-correlated (Pearson’s rho 0.49; p-value < 2e-16). However, a subset of genes, particularly in neurons, displayed a higher spa- tial autocorrelation in the tissue context than their autocorrelation in UMAP space (Figure 3.3B; Figure 3.43A,B). Closer inspection of neurons revealed that the Hox genes, a class of homeotic transcription factors that specify the body plan, were featured prominently in this subset, consistent with spa- tial patterning that cannot be explained solely by spatial restriction of a single cell subtype (Fig- ure 3.3B,C; Figure 3.43A-D). Genes in the HoxA cluster were expressed in a manner reminiscent 80 of the establishment of an anterior-posterior axis of the spinal cord (111) within both excitatory and inhibitory neurons (Figure 3.3D; Figure 3.43A,B,D). The HoxA cluster’s lack of neuronal sub- type restriction was also observed in a non-spatial spinal cord dataset from the developing mouse (30) (Figure 3.43E,F) and validated using RNA fluorescence in situ hybridization (RNA FISH) (Figure 3.44A-C). Additional, non-Hox genes also displayed an excess of spatial autocorrelation (Figure 3.3E,F; Figure 3.44). One such gene, Cyp26b1, is an enzyme that metabolizes the developmental mor- phogen retinoic acid (RA). The differential response to morphogens like RA is crucial for the pat- terning of tissues and leads to differential cell type specification in the developing organism (112). Upon examination, sci-Space data localized the focal expression of Cyp26b1 to the brainstem, consistent with prior work (113, 114), with expression observed in multiple neuronal subclusters (Figure 3.43A,B; Figure 3.44A,B). This pattern of expression was observed across multiple slides (Figure 3.44C) and validated by RNA FISH staining of Cyp26b1 and neuronal-subtype specific genes (Figure 3.3G; Figure 3.44C). Together, these data identify Cyp26b1 expression both in pro- genitors (radial glia) lining the hindbrain and in their progeny, spatially-adjacent mature neurons, suggesting that the expression of Cyp26b1 is retained as these cells differentiate. This analysis il- lustrates how sci-Space can distinguish between spatial patterns of gene expression that are driven by a single cell type from those that are driven by multiple cell types. 3.3.5 Quantifying the explanatory power of spatial position We next asked how each cell type’s transcriptome varied globally across the embryo rather than on a gene-by-gene basis. We calculated the angular distance between the global transcriptomes of pairs of cells of the same type located at varying distances from one another (115). For many cell types, as the physical distance between cells increased, so did the angular distance between their transcriptomes. However, this trend varied by cell type, e.g. it was particularly pronounced in radial glia, neurons and endothelial cells (Figure 3.4A; Figure 3.45A,B). To quantify the contribution of spatial context to variation in gene expression across individual 81 Figure 3.3: Spatially restricted gene expression for developing neurons.(A) Number of spatially signifi- cant (FDR < 0.001) autocorrelated genes within each slide (color) and cell type. Only cell types with more than 50 cells per slide were included. (B) Log-log (log10) plot of autocorrelation in UMAP embedding (x- axis) versus autocorrelation in spatial coordinates (y-axis) for each gene. Computed on excitatory neurons from Slide 1. Moran’s I values close to 1 indicate perfect spatial correlation, while a value of 0 indicates a random spatial distribution. Hox genes are highlighted. (C) log10-scale boxplot of Moran’s I statistic for Hox genes displayed in (B) versus all other expression level-matched genes (p-value < 0.001, two sided t-test). (D) Gene expression of HoxC cluster in Slide 1. (E) Similar to (B), log-log (log10) plot of autocorre- lation in UMAP embedding (x-axis) versus autocorrelation in spatial coordinates (y-axis) for each gene with genes in different regimes highlighted for Slide 1. (F) Expression patterns across Slide 1 for other spatially restricted genes that are not restricted to a single neuron subcluster. (G) Comparison of sci-Space (shown on the more densely populated Slide 14) and RNA FISH (RNAscope) detected Cyp26b1 patterns of expression (gray) and coexpression with markers (colors as indicated in key) for neuronal and supportive cell types. cells, we developed a new statistical approach (Methods). Briefly, we first partitioned cells into groups based on cell type and spatial location. Then, we computed the angular distance between each cell and the average expression profile for cells of that same type in the same spatial bin. Some variance across cells is technical, due to sampling only a fraction of transcripts in each cell. We estimated the technical variance attributable to data sparsity by simulating single cell UMI count 82 profiles from each of the group averages. After subtracting technical variance from total variance, we were able to quantify how much of the remaining biological variance was due to each cell’s type and/or spatial position. We also estimated the variance that one could expect to explain using this approach under a null model that permuted cell type and spatial position labels. Before applying this variance decomposition approach to sci-Space data, we sought to validate it on another dataset for which sources of transcriptional variation across cells are better under- stood. Packer et al. previously analyzed developing C. elegans embryos, which have a defined, deterministic lineage, and observed that a cell’s position in the lineage is highly predictive of its gene expression profile (85). We therefore decomposed variance across cells from this dataset by grouping cells that shared a common parent in the worm lineage. According to the Law of Total Variance, total variance across cells should equal the variance within these groups plus the vari- ance between group averages. Reassuringly, this was the case (Figure 3.46C). In agreement with Packer et al., our new approach showed that variance attributable to a cell’s lineage-relationship in the C. elegans lineage peaked around generation 7 (Figure 3.46D). We next applied this variance decomposition approach to the sci-Space data. Variance at- tributable to sparse UMI sampling accounted for 95.1% of the observed variance in global gene expression across cells. In subsequent analyses, we focused on the remaining 4.9% “non-sampling” variance, and found that spatial position within the mouse embryo contributes to transcriptional heterogeneity within many cell types. Across all cells in the dataset, cell type alone accounted for 19% of the non-sampling variance in global gene expression. Moreover, a joint model that included both cell type and spatial position accounted for 50% of the non-sampling variance (Figure 3.47). The implication – that spatial information explains as much if not more of gene expression variance as major cell type – was supported by the recovery of cell type and spatial gene modules of similar size and composition (Figure 3.48). However, some cell types’ transcriptomes appeared far more sensitive to spatial position than others (Figure 3.4B). For example, chondrocytes were highly in- fluenced by position within the embryo, reflecting the ongoing development of various connective tissue lineages at E14.0, and explained at least in part by subclusters that appear to correspond to digit condensates and craniofacial mesenchyme (Figure 3.4C; Figure 3.49). Other such cell types 83 included neurons and their precursors, the radial glia. Figure 3.4: Quantifying and characterizing the variance in gene expression attributable to spatial position.(A) Pairwise angular distance (radians) between global transcriptomes of cells of the indicated cell types. Cell pairs are grouped by distance on the physical array (mm). ** : p-value < 0.001, *** p-value < 0.0001, Wald linear regression test (Methods). (B) Proportion of variance, apart from that attributable to sparse UMI sampling, explained within cells of each type by spatial position. Point size indicates number of cells and point color indicates the slide of origin. (C) Recovered positions of chondrocytes from Slides 6, 11 and 14 colored by subcluster. Arrows indicate focal concentrations of craniofacial mesenchyme (green) and digit condensate subclusters (red). Insets to the right of each plot show parts of each image with similarly positioned arrows. White text of each inset labels the anatomic structure displayed. Scale bars in panel (C) = 0.25 mm. 84 3.3.6 Pseudotemporal sci-Space trajectories reflect neuronal migration dynamics To explore how spatial context might relate to gene expression heterogeneity in a developing cell lineage, we focused on radial glia and neurons. In particular, we hypothesized that we might be able to detect and localize the coordinated processes of neuronal differentiation and migration (116). UMAP dimensionality reduction of these cell types revealed the presence of three distinct trajectories originating in radial glia and leading to neurons (Figure 3.5A). Gene expression dy- namics along these three branches were consistent with neuronal differentiation – the upregulation of cell cycle genes followed by expression of genes involved in migration (Figure 3.50). Each branch was strongly enriched for subtype specific marker gene expression – Pou4f1+ Pax3+ tec- tal neurons, Isl1+/Lhx6+ cortical interneurons, and Emx1+/Neurod6+ cortical pyramidal neurons – indicating that the embedding captures their specification from radial glia (Figure 3.5B). We then examined how these trajectories were spatially distributed, by segmenting the brain from each tissue section into one of six regions using the Allen Institute’s Anatomical Reference Brain At- las (www.atlas.brain-map.org) as a guide. Cells from each trajectory overwhelmingly occupied a distinct brain region (Figure 3.5C). To quantify progression through differentiation, we calculated pseudotime for each cell using radial glia as the root; these values were correlated with inferred embryonic age from the developing mouse brain atlas (DMBA) (103) (Figure 3.51). Intersecting pseudotime and spatial information, we find that cells early in differentiation clustered around the ventricles in the forebrain and developing midbrain, while those farther away exhibited a more differentiated transcriptome (Figure 3.5E; Figure 3.51). The spatial gradients of cellular maturity estimated with sci-Space data are consistent with well-documented coordination of cellular differentiation and neuronal migration. In the pallium, immature neurons migrate and differentiate radially outwards leading to the inside-out develop- ment of the cortical layers (116). In the sub pallium, cortical interneurons born in the ganglionic eminences migrate tangentially to populate the developing cortex and olfactory bulb (117). Our data also identify a third major pattern of migration in which precursors emanate from the dorsal aspect of the ventricular zone in the developing midbrain. These midbrain neurons seem to migrate 85 both radially, towards the pial surface, and tangentially, parallel to the pial surface, to populate this region (Figure 3.5E - Slides 8, 13, and 14; Figure 3.51 - slides 4 and 7). Although radial and tangential migration are generally discussed as mutually exclusive phenomena, our data – in line with some prior work (118, 119) – suggests otherwise in the developing midbrain. Furthermore, these cells share a common transcriptional program with differentiating and migrating neurons in the pallium and subpallium (Figure 3.5F). 3.4 DISCUSSION In sum, sci-Space is a new method for spatial transcriptomics that retains single cell resolution while capturing spatial information at a scale specified by a patterned array of cell hashing oli- gos. As a proof-of-concept, we applied sci-Space to retrieve the approximate spatial coordinates of transcriptionally profiled cells across whole mount sections from E14.0 mouse embryos. The sci- Space data are readily integrated with non-spatial single-cell RNA-seq data previously collected from mouse embryos at adjacent timepoints, enabling rapid annotation of diverse cell types and visualization of cell type-specific, spatially patterned gene expression, i.e. digital in situs. We identify examples, some expected and others novel, of genes expressed in an anatomically pat- terned manner within cells of a given type. The spatial resolution of sci-Space is presently limited by the patterned array of hashing oligos, here to approximately 200 microns. Although increasing spot density and decreasing spot size are a straightforward path to increasing resolution, sci-Space is unlikely to detect effects arising from interactions between adjacent cells or to resolve gene expression gradients over short (<30 µm) distances. A further limitation is that we currently recover only a fraction of cells from each serial section, such that we are imputing spatially resolved maps of gene expression based on a “sample”, rather than achieving dense measurements over full sections as one would obtain from in situ hybridizations, spatial transcriptomics or Slide-seq. Nonetheless, sci-Space fills a key need not addressed by these or other technologies. Like other spatial transcriptomic methods relying on patterned arrays (e.g. Slide-seq), sci-Space can be 86 Figure 3.5: Pseudotemporal spatial trajectories capture migratory patterns in the developing brain. (A-C) UMAP embedding displaying the neural trajectories colored by (A) cell type, (B) specific gene ex- pression, (C) cortical region, or (D) pseudotime. (E) Spatial positions of neurons and radial glia in the cortex colored by pseudotime if a neuron was found in the trajectory or otherwise colored grey. Insets of rostral (below) and caudal (above) brain outlines are shown to the right of each slide (M-Midbrain, P-Pallium, SP- Sub Pallium, V-Ventricle). (F) Scaled and centered gene expression for genes (rows) significantly varying over pseudotime in all three trajectories. Enriched Gene Ontology Biological Processes terms (GO BP) are displayed next to clustered genes. 87 applied, routinely and efficiently to large regions, e.g. whole embryo serial sections. However, unlike these methods, sci-Space recovers single cell transcriptomes. It can therefore capture pat- terns of spatial gene regulation private to specific cell types and estimate the contribution of each cell type to the expression of morphogens and other signalling molecules, both within and across anatomical regions. Moreover, sci-Space data can serve as a spatial “scaffold” for conventional, non-spatial single-cell RNA-seq atlases, which may be considerably more challenging to map onto tissues profiled by spatial profiling methods that lack single-cell resolution. At the other end of the methodological spectrum, seqFISH, MERFISH, FISSEQ and similar multiplex mRNA imaging methods do provide single cell resolution and are readily capable of detecting gene expression gradients consequent to cell-cell interactions. On the other hand, the complex instrumentation and long imaging times that these technologies require are impractical to routinely apply at the scale of entire tissues, organs or embryos. Furthermore, at least for seqFISH and MERFISH, genes are profiled through targeted hybridization with gene-specific probes that must be designed and validated. In contrast, the imaging steps of sci-Space can be performed on a conventional widefield microscope in a few minutes per slide, and capture the global gene expression program via untargeted 3’ end mRNA sequencing. Finally, we developed a statistical approach to identify cell types in the developing embryo that exhibit spatially regulated gene expression. This method quantified the contributions of spatial position, cell type, and technical factors to decompose the variance across cells’ transcriptomes. We found that in some cases, spatial context explained as much or more variance as cell type. Closer analysis of radial glia and neurons revealed gradients of developmental maturity in different regions of the brain indicative of known and novel patterns of neuronal migration. Together with data from complementary technologies, we anticipate that the further application of sci-Space to serial sections spanning entire embryos from many timepoints will facilitate the construction of a set of highly time- and space-resolved 4-dimensional atlases of gene expression across the entirety of mammalian development. 88 3.5 ACKNOWLEDGMENTS We thank members of the Lampe, Stevens, Shendure, Trapnell lab and others for their critical feedback, particularly Eva Nichols, Riza Daza, Junyue Cao, Vijay Ramani, and Lauren Saunders for helpful discussions during the development of the protocol and analysis of the results. We also thank Choli Lee and Dana Jackson for supporting the research environment. Funding: As- pects of this work were supported by funding from the NIH (1R01HG010632 to C.T. and J.S.; DP2HL137188 to K.R.S.), an NIH/NIBIB training grant (T32EB1650 to S.S.), the Brotman Baty Institute for Precision Medicine, the Paul G. Allen Frontiers Foundation (Allen Discovery Center grant to J.S. and C.T.), the Chan Zuckerberg Initiative (to C.T.), the Washington Research Foun- dation Postdoctoral Fellowship (to M.C.R.). J.S. is an investigator of the Howard Hughes Medical Institute. Author contributions: S.R.S. and M.C.R. developed sci-Space and performed the experi- ments. J.L. aided in the fabrication of sci-Space slides with the support of P.L.; S.R.S. and M.C.R. performed the computational analyses with assistance from E.B., P.G., J.F., M.D. and J.P. and under the supervision of C.T.; Immunostaining and RNA FISH experiments were performed by M.C.R. with assistance from S.S.; M.S. and C.L. provided critical feedback on the project; S.R.S., M.C.R., C.T., K.R.S. and J.S. wrote the manuscript with input from all coauthors; and K.R.S. J.S. and C.T. supervised the project. Competing interests: One or more embodiments of one or more patents and patent applications filed by the University of Washington may encompass the methods, reagents, and data disclosed in this manuscript. Data and materials availability: Processed and raw data can be downloaded from NCBI at GSE166692. All the code used to perform the presented analyses is hosted on github and indexed on Zenodo. Developed laboratory protocols can be found on protocols.io (see Methods). 89 3.6 METHODS 3.6.1 Overview Sci-Space is a single cell technology wherein positional information is recorded and read out in conjunction with single cell transcriptomes. This process is mediated by the spatial transfer of hashing oligos (which resemble poly-adenylated (polyA) transcripts) onto cells. A combination of 3 hashing oligos is meant to uniquely specify the slide from which a cell originates as well as its position within a regular grid on that slide. After performing sci-RNA-seq library preparation and sequencing, the spatial barcodes found within each cell’s library are used to assign it to a slide as well as to position it within the grid. The molecular biology details needed to perform the sci-Space procedure and library preparation are outlined below as well as in a series of published protocols.io sites: • Creating sci-Space grids for spatial barcoding - https://www.protocols.io/view/creating-sci- space-grids-for-spatial-barcoding-bm64k9gw • Spatial transfer of oligonucleotides and imaging - https://www.protocols.io/view/spatial-transfer- of-oligonucleotides-and-imaging-bqt7mwrn • Single cell RNA sequencing library preparation - https://www.protocols.io/view/single-cell- rna-sequencing-library-preparation-2-l-bquamwse 3.6.2 Creating sci-Space grids for spatial barcoding A thin membrane of dried agarose was fabricated on the surface of microscope slides (Superfrost Plus, Thermofisher). This agarose matrix absorbed and retained an array of spotted oligo hashes. To prepare nuclease-free agarose, 3% w/v low melting temperature agarose powder (SeaPlaque, Lonza, Bend, OR) was added to deionized water containing 0.1% v/v diethyl pyrocarbonate, incu- bated 2 hr at room temperature, and autoclaved for 15 min. The uniform thickness of the layer of agarose across the slide surface was patterned using spacers of two stacked 22 x 22 mm, number 90 one thickness (0.15 ± 0.02 mm each) coverslips overhanging either end of the slide. Molding of the agarose was performed by pipetting a 300 µL volume of heated agarose solution into the center of the slide and slowly placing a second slide onto the agarose solution avoiding the formation of bubbles. The molding slide was allowed to rest on the cover glass spacers. After the agarose had gelled between the two slides ( 30-60 min on ice) a razor blade was used to release the exposed edges of the agarose layer from the top, molding slide. The two slides were then carefully slid apart and the cover glass spacers were removed. The resulting thin layer of agarose gel was dried onto the bottom slide overnight in a biosafety cabinet. All agarose slides were UV-treated for 20-30 min prior to spotting to further protect against nuclease activity. The space-grid array of hashing oligos and SYBR green reference points was spotted onto agarose-coated slides using a QArray2 microarray scanner (Genetix, New Milton, Hampshire, GB). A series of 384-well high sample recovery plates (Molecular Devices, San Jose, CA, X7020) was prepared containing a final concentration of 15µM spot oligo and 2.5µM sector oligo per well (Integrated DNA Technologies, Coralville, IA), and 0.5% v/v glycerol, with or without SYBR green dye ([5x] Thermofisher, S7585) to achieve the predetermined oligo and SYBR green refer- ence point layout when a 21 x 21 spot/pin array was printed with 16 spotting pins (4 x 4 grid). These printing parameters gave space-grids containing 7056 (84 x 84) spots of unique oligo com- binations. The spotting height was adjusted to ensure consistent contact of the spotting pins with the transfer slides’ agarose coating. 3.6.3 Testing blotting concentrations of hash oligos and SYBR green Space-grids for testing hash oligo blotting concentrations were prepared as noted above using the QArray2 microarray scanner (Genetix, New Milton, Hampshire, GB). Each space-grid was given a single distinct DNA barcode sequence at a chosen final concentration (10µM, 20µM, 25µM or 50µM) with a single sector marked with 5x SYBR green. These space-grids were then blotted onto a series of mouse embryo sections ranging from E13 to E16 (Zyagen, San Diego, CA), by sandwiching the two slides (tissue and space-grid). First, permeabilization and hashing solution 91 was prepared for each slide by mixing a unique slide-specific hash oligo (5µL at 10µM) in the 495µL permeabilization solution [10mM Tris/HCl pH 7.4, 10mM NaCl, 3mM MgCl2 with 1% v/v superase inhibitor (Invitrogen) and 0.1%v/v IGEPAL CA-630 (Sigma Aldrich)]. Following permeabilization, each slide was barcoded via transfer with a test space-grid. The transfer was then imaged and the cells were harvested by cell scraping into a solution of 5% paraformalde- hyde (cat no. 100504-940, VWR) in 1x PBS. This is described in detail in a subsequent section (“Spatial Transfer of Oligonucleotides and Imaging”). After 15 minutes of fixation on ice, cells were centrifuged (800g for 10 minutes), pooled and subjected to sci-RNA-seq2 library preparation (120). This is described in detail in a subsequent section (“Single cell RNA sequencing library preparation”). To test co-cultured human and mouse cells, HEK293T cells and NIH3T3 cells were placed in a droplet on a coverslip coated with 1% gelatin. 4 coverslips were prepared and cells were allowed to attach overnight to the coverslip surface. Next, each coverslip was permeabilized as detailed above and labeled with a gel containing a single hash oligo. These nuclei were then scraped into paraformaldehyde, fixed for 15 minutes on ice and subject to sci-RNA-seq library preparation as detailed below. 3.6.4 Testing spotted space-grids Hash oligos from three space-grids were dissolved in 500µL of permeabilization solution [10mM Tris/HCl pH 7.4, 10mM NaCl, 3mM MgCl2 with 1% v/v superase inhibitor (Invitrogen) and 0.1%v/v IGEPAL CA-630 (Sigma Aldrich)]. Concurrently, three aliquots of 2 million HEK29T cells were harvested and washed once with 1x PBS. The resuspended hash oligo solutions were then used to lyse and label the HEK293T nuclei. After a 3 minute incubation on ice, the nuclei suspension was chemically fixed with 5mL of 4% paraformaldehyde and incubated for 15 minutes on ice. Nuclei were then pelleted at 500g for 5 minutes, washed, with 500µL Nuclei Buffer (NSB) [10mM Tris/HCl pH 7.4, 10mM NaCl, 3mM MgCl2 with 1% v/v superase inhibitor (Invitrogen) 1% v/v BSA (New England Biolabs)] and permeabilized by resuspension in 500µL NB + 0.2% 92 Triton-X. These nuclei were centrifuged and washed with 500uL of NSB. These nuclei were then pelleted and 5000 nuclei from each sample was loaded into indexed reverse-transcription reactions. Reverse transcription was performed as described previously (120) were pooled and 25 nuclei were sorted into a 96 well plate containing 16µL of elution buffer per well. Libraries were prepared by performing an indexed PCR using 20uL of NEBNext High-Fidelity 2X PCR Master Mix (NEB), 2µL of 10µM indexed P5 primer and 2µL of 10µM indexed P7 primer. PCR was run for 18 cycles with the following settings: 72°C for 5 min, 98°C for 30 sec, 18 cycles of (98°C for 10 sec, 66°C for 30 sec, 72°C for 30 sec) and a final 72°C for 5 min. These libraries were then pooled and sequenced on a Nextseq 500 (Illumina, San Diego, CA) using a high output 75 cycle kit (Read 1: 18 cycles, Read 2: 52 cycles, Index 1: 10 cycles and Index 2: 10 cycles). 3.6.5 Spatial transfer of oligonucleotides and imaging Serial sections of an E14.0 mouse embryo were purchased (Zyagen, San Diego, CA) and stored at -80°C prior to use. Oligo hashes were transferred in their arrayed pattern from the space-grid slides to fresh-frozen embryo sections by diffusion through cell permeabilization buffer. First, the embryo slide was placed so that it rested (tissue facing up) with the tissue section between two transfer clips. Subsequently, 500µL of nuclei permeabilization buffer [10mM Tris/HCl pH 7.4, 10mM NaCl, 3mM MgCl2 with 1% v/v superase inhibitor (Invitrogen) and 0.1%v/v IGEPAL CA- 630 (Sigma Aldrich)] with 5µL of slide-specific hashing oligo at 10µM and 5µL of 500µM stock DAPI, was pipetted gently onto the tissue section and across the long edge of the embryo slide nearest the user using an wide bore p1000 tip. A space-grid transfer slide was then positioned (agarose surface facing the tissue section) so that the arrayed oligos were aligned between the two transfer clips and spanned the tissue section’s extent. Placement of the space-grid slide was achieved by tilting the slide so that its long edge nearest the user contacted the edge of the tissue section slide and fit under the overhanging fastening teeth of the transfer clips. The space-grid slide was then rocked toward the embryo section slide until the two slides were face-to-face with the tissue section contacting the space-grid oligo array-laden agarose membrane. Excess buffer 93 was allowed to wick into a laboratory wipe. When stacked, the slides snapped into the transfer clips and were thereby securely held together during transfer. The slide stack was moved to the microscope stage and the entire embryo section was imaged in GFP and DAPI channels. The transfer slide was then removed from the transfer clips and separated from the tissue. Cells of the embryo section were then scraped using a cell scraper (Fisherbrand, GDPC240) from the slide into a 4% paraformaldehyde fixing solution. At this point, Slides 7-14 were subjected to sonication using the bioruptor sonicator (Diagenode, product number B01020001). This extra round of sonication aided in disassociating more nuclei. After fixation for 15 minutes on ice, cells were spun down in 1.5mL tubes in a chilled benchtop centrifuge at 800g for 10 minutes. The supernatant in each tube was removed and cells were pooled in 1mL of NSB [Nuclei Buffer (10mM Tris/HCl pH 7.4, 10mM NaCl, 3mM MgCl2) with 1% v/v superase inhibitor (Invitrogen) and 1% v/v BSA (New England Biolabs)], flash frozen and stored at -80C. 3.6.6 Single cell RNA sequencing library preparation Frozen nuclei were thawed over ice and spun down at 800g for 8 minutes in a 15mL conical tube. Cells were then permeabilized in 400µL of permeabilization buffer (NSB + 0.25% Triton-X) for 3 minutes and then spun down at 800g for 8 minutes. Following resuspension in 500µL NSB, two-level sci-RNA-seq libraries were prepared as previously described (98, 120). Briefly, nuclei were first sonicated for 12 seconds using the bioruptor sonicator on the low setting. This caused disruption of many nucleus aggregates that had formed. Cell counts were then obtained by staining nuclei with 0.4 % trypan blue (Sigma-Aldrich) and counted using a hemocytometer. 5000 nuclei in 2µL of NSB and 0.25µL of 10mM dNTP mix (Thermo Fisher Scientific, R0193) were then distributed onto a skirted twin.tec 96 well LoBind plate (Fisher Scientific, 0030129512). Next, 1µL of uniquely indexed oligo-dT at 25µM (120) was added to every well and mixed. This 96-well plate was then incubated at 55C◦ for 5 minutes and then immediately placed on ice. Next, 1.75µL of reverse transcription mix (1µL of Superscript IV first-strand buffer, 0.25µL of 100 mM DTT, 0.25µL of Superscript IV and 0.25µL of RNAseOUT recombinant ribonuclease inhibitor) 94 was then added to every well and mixed. Plates containing the reverse transcription reactions were then incubated using a ramping reverse transcription protocol on a thermocycler: 1. 10C◦ for 2 minutes 2. 20C◦ for 2 minutes 3. 30C◦ for 2 minutes 4. 40C◦ for 2 minutes 5. 50C◦ for 2 minutes 6. 55C◦ for 15 minutes 7. 4C◦ forever Wells were pooled and nuclei were transferred to a flow cytometry tube through a 0.35 µm filter cap and DAPI added to a final concentration of 3 µM. At this point pooled nuclei were seeded using one of 2 methods; (1) sorted on a BD FACS Aria II cell sorter (Slides 1-6) or (2) diluted (Slides 7-14). Sorted cells from Slides 1-6 were prepared using 8 RT plates. After calculating the collision rate using the birthday problem calculator (121), we sorted 200 nuclei per well into 96-well LoBind plates with each well containing 5µL of EB buffer (Qiagen) and 0.75µL of second strand mix (0.5 µL of mRNA second strand synthesis buffer and 0.25µL of mRNA second strand synthesis enzyme, New England Biolabs). Diluted nuclei originating from slides 7-14 were prepared using 10 RT plates were first diluted to 50 nuclei per µL in NSB. Diluted nuclei were then premixed with second strand-synthesis reagents (0.5 µL of mRNA second strand synthesis buffer per 5µL of nuclei suspension and 0.25µL of mRNA second strand synthesis enzyme per 5µL of nuclei suspension). 5.75µL of 95 this nucleus/second-strand synthesis solution was dispensed into 96-well LoBind plates to seed 250 nuclei per well. Second strand synthesis performed at 16C◦ for 150 minutes. Tagmentation was then performed by addition of 5.75µL of tagmentation mix per well (0.01µL of a custom n7-loaded Tn5 enzyme in 5.74µL 2x Nextera TD buffer, Illumina) and plates incubated for 5 minutes at 55C◦. This reaction was terminated by addition of 12µL of DNA binding buffer (Zymo, D4004-1-L) and incubated for 5 minutes at room temperature. 36µL of Ampure XP beads (Beckman Coulter, A63880) were added to every well, DNA purified using the standard Ampure XP protocol eluting with 17µL of EB buffer. 16µL of this eluate was then transferred to a new 96 well LoBind plate for index PCR. For PCR, 2µL of 10µM indexed P5, 2µL of 10µM indexed P7 (120) and 20µL of NEBNext High-Fidelity master mix (New England Biolabs, M0541L) were added to 16 uL of eluted template DNA. PCR indices were arrayed such that each well contained a unique combination of P5 and P7 barcodes. PCR was then performed using the following program: 1. 75C◦ for 3 minutes 2. 98C◦ for 30 seconds 3. 98C◦ for 10 seconds, 4. 66C◦ for 30 seconds 5. 72C◦ for 1 minute 6. Return to Step 3 for 17 times 7. 72C◦ for 5 minutes. 8. 4C◦ forever 96 After PCR, all wells were pooled, concentrated using a DNA clean and concentrator kit (Zymo Research, D4033) and purified via a 0.8X Ampure XP cleanup. Final library concentrations were determined by Qubit (Invitrogen), libraries visualized using a TapeStation D1000 DNA Screen tape (Agilent, 5067-5582) and libraries sequenced on a Nextseq 500 using a high output 75 cycle kit (Illumina, 20024906). Libraries were denatured and diluted to 2pM. Sequencing was performed using onboard primers using the following sequencing recipe: • Read 1: 18 cycles (8 base-pair UMI and 10 base-pair RT barcode • Index 1: 10 cycles (10 base-pair PCR index1) • Index 2: 10 cycles (10 base-pair PCR index2) • Read 2: 52 cycles (52 bases of transcript or hash-oligo) 3.6.7 Pre-processing of sequencing data Sequencing data was processed as described previously (98). Briefly, sequencing runs were first demultiplexed using bcl2fastq v.2.19. Only barcodes that matched reverse transcription indices within an edit distance of 2 bp were retained. Following assignment of indices, polyA tails were trimmed using trim-galore (https://github.com/FelixKrueger/TrimGalore), and reads were mapped to a mouse transcriptome (mm-10), human-mouse transcriptome (hg-38 and mm-10) using the STAR aligner. Following alignment, reads were filtered for alignment quality, and duplicates were removed. Reads were considered duplicates if they (1) mapped to the same gene, (2) mapped to the same cell barcode and (3) contained the same unique molecular identifier (UMI). Reads that met the first two criteria, and differed by an edit distance of 1 from a previously observed UMI were also marked as duplicates and discarded. Non-duplicate reads were assigned to genes using bedtools to intersect with an annotated gene model. All 3’ UTRs in the gene model were extended by 100 bp to account for the possibility that some gene 3’ UTR annotations may be too short, causing genic reads to improperly be annotated as intergenic. Cell barcodes were considered 97 to correspond to a bona fide cell if the number of unique reads associated with the barcode was greater than an interactively defined threshold on a knee plot. Reads from cells that passed this UMI count threshold were first aggregated into a sparse matrix format and then loaded and saved as a CDS object for analysis with Monocle3 or Seurat. 3.6.8 Slide registration To map cell locations to an image of the embryo, fluorescent SYBR green spots with known po- sitions are used to orient cells. Images of the hash array with fluorescent SYBR green spots on top of the DAPI-stained embryo section were taken with a 2.5x magnification (Zeiss Observer Z1 Microscope). Prior to the transfer of the image of Slide 12, the Slide 12’s image file was accidentally erased and could not be recovered. The sequenced cells mapping to Slide 12 are included in the dataset with relative spatial positions based on the recovered hash oligos. However, for this slide we were unable to perform any analyses that relied on segmentation or slide registration. The captured images were then used to orient the hash array to the embryo section. More specifically, co-registration of the imaged embryo sections and the oligo hash tagged transcrip- tomes was achieved through alignment of the SYBR green waypoints imaged during transfer to their position within an ideal space-grid layout. Coordinates for SYBR green spots imaged during oligo hash transfer to the embryo section and the corresponding coordinates in an image of an ideal space-grid were obtained in Fiji image processing software (106) using the Big Warp function of the BigDataViewer plugin (107). An affine matrix was computed using the coordinates as source (embryo image) and target (space-grid image) control points in the AffineTransformation function in the “vec2dtransf” and “imager” packages in R. The matrix was applied to the embryo section image. Sequenced nuclei were then mapped to the aligned space of the transformed image and space-grid using their space-grid hashes. The following formula was used to calculate the number of microns per pixel and thereby estimate the size of each spot on the hash array. 98 microns per 1 pixel = (native camera pixel size / objective / camera adaptor) One pixel was equal to 1.816 microns based on the camera pixel size of 4.54 for a Zeiss Axio- cam 503 Mono Camera, an objective size of 2.5x and a camera adaptor size of 1. 3.6.9 Assigning spatial labels from hash reads Reads from hash oligos were demultiplexed as described previously (98). Briefly, demultiplexed reads that matched combinatorial indexing barcodes were examined to identify hash reads. Reads were considered hash reads when they met two criteria: 1) the first 10 bp of read 2 matched a hash barcode in the experiment within an edit distance of two; and 2) contained a polyA track between base pairs 12 to 16 of read 2. These reads were then deduplicated by cell barcode and collapsed by UMIs to create a vector of hash oligo UMI counts for each nucleus in the experiment. To assign each nucleus to the slide from which it came, we tested whether its sci-RNA-seq library was enriched for a particular hash barcode. We compared a nucleus’s hash UMIs against a ‘background distribution’, which under ideal circumstances, would be random and uniformly distributed. To estimate the background distribution, we simply aggregated the hash UMIs from cell indices for which fewer than 10 mRNA UMIs were collected, reasoning that these reflect library contributions from ambient reverse transcriptase products, debris fragments, etc. We then compared the hash UMIs for nucleus to this background by a chi-squared test. After correcting the resulting p-values were corrected for multiple testing by the Benjamini-Hochberg procedure, we rejected the null hypothesis that originates from the background distribution at the specified FDR (5% FDR was used in this study). Those nuclei with hash counts deemed greater than background were also evaluated for enrichment for a single hash sequence. Enrichment ratios were calculated as the UMI count ratio of the most abundant vs. the second most abundant hash oligo. Specifically, if the UMI count for the most abundant hash in a nucleus is α-fold higher than the second most abundant, is marked as a singleton. α was set to 5, which corresponded to a nadir between two modal outcomes – separated unlabeled cells and singularly labeled cells. Cells that fell below 5-fold enrichment of a unique hash oligo were flagged as a multiplet or debris and discarded. 99 A cell’s spatial position within the grid consists of a specific combination of two oligonu- cleotides, a spot oligo and a sector oligo. To find a cell’s position within the grid, we first took a single cell’s vector of spot counts and mapped these counts to their position in the sci-Space grid. Then we performed a 3x3 gaussian convolution on every position, allowing us to account for and integrate spatially local signal from neighboring positions and simultaneously reduce spurious background. Next, we performed an element-wise multiplication of convolved spot values by the matrix of sector counts measured for that cell. The product of spot and sector oligos were then ranked and a cell was mapped to the top ranking combination which matched two criteria: (1) the combination represented a valid pairing and (2) the combination mapped within the boundary of the imaged embryo. This boundary was determined by manually segmenting the outline of the DAPI stained image of the embryo. Upon manual inspection of these draft spatial positions, we noticed a minority of nuclei which were clearly mismapped (e.g. cardiomyocytes outside of the heart). This seemed to occur through a nucleus’ absorption of a neighboring sector oligo or alternatively via barcode collision between cellular debris carrying a spatial index and the sequenced nucleus. To correct these misassignments, each slide was manually segmented into regions of interest (ROI) where a cell type was focally concentrated. Next we asked whether a nucleus of a given cell type had a plausible alternate mapping within these ROIs. A nucleus was moved from its draft position if the nucleus’ highest convolved spot value within the region of interest was within 5-fold of its highest spot value. The value of 5-fold was chosen by examining the distribution of all alternate mappings. This procedure was then repeated for each cluster using an automated algorithm to detect high density regions of clusters within a slide. Through this process, the spatial calls of 9.1% nuclei were remapped. Finally, we removed cardiomyocytes and hepatocytes that mapped outside of the heart and the liver, respectively, resulting in the removal of 369 nuclei. 100 3.6.10 Estimating nuclei counts from embryo images Using python, the 2.5x magnification embryo DAPI stained images were preprocessed using a white top-hat transform followed by a histogram equalization to reduce uneven lighting and in- crease contrast respectively. The images were then thresholded using Otsu’s method. In order to overcome the challenge of counting individual cells in nuclei clusters, the resulting binary masks were separated into ‘dense’ and ‘sparse’ nuclei masks using a connected components algorithm. The dense nuclei masks were used to isolate nuclei clusters in the original embryo images, which were then thresholded to a secondary value defined as Otsu’s value plus a constant intensity shift. The sparse and the dense nuclei masks were then distance transformed, using Euclidean geometry, to generate distance maps. A peak finding algorithm was used to isolate the centroids of peaks in the distance maps and resulting unique centroids were counted as nuclei. 3.6.11 Cell type classification Nearest neighbor classification was performed by aligning the cells from this study to cells from E13.5 time point from the MOCA single cell dataset (84). This MOCA dataset was chosen because it was prepared using nuclear sci-RNA-seq. It is our experience that alignment between datasets produced using the same technology are less sensitive to hyperparameter selection during align- ment. The E13.5 time point was chosen because this time point most closely matched the E14.0 timepoint sequenced in this study. Count matrices from the two datasets were subsetted for genes found in both datasets and then combined. The E13.5 time point was then downsampled and the two datasets were aligned using Seurat v3(102) dataset integration using reciprocal PCA. For each cell in the E14.0 time point the 10 nearest neighbors in UMAP space from the MOCA dataset were recorded. Each cell was then assigned the majority nearest neighbor label. Finally, to remove poor confidence cell type labels, the E14.0 data was clustered and cell type labels that did not account for more than 5% of a cluster were assigned “Unknown”. Garnett classification (104) was performed using a marker-free classifier trained on the E13.5 time point from the MOCA dataset. This classifier was then applied 101 to the cells sequenced in this study. This same process was repeated to match neurons from the developing mouse brain atlas dataset (DMBA) to neurons in the sci-Space dataset(103). Briefly, neurons and radial glia from the E13.5, E14.0 and E14.5 timepoints were used to perform nearest neighbor alignment using reciprocal PCA (102). The majority label from a cell’s 5 nearest neighbors was used to transfer a number of different labels including cell type, anatomical dissection, age, and UMAP cluster in the original dataset. Finally, the different inferred labels were collated along with the top differential genes marking each cluster. The final annotation set consists of a combination of cell types transferred from the MOCA and DMBA datasets along with manual annotation matching a cluster’s differen- tial gene expression. 3.6.12 Immunostaining and adjacent image alignment Before immunostaining serial sections adjacent to sequenced sections were fixed in 4% paraformalde- hyde (Electron Microscopy Sciences) in phosphate buffered saline (PBS, ThermoFisher) for three minutes at room temperature. Sections were then washed for five minutes three times in PBS-T (0.1% Tween-20, VWR), permeabilized for 10 minutes at room temperature in 0.1% Triton X-100 (VWR) in PBS, and washed for five minutes three times in PBS. Autofluorescence was quenched using TrueBlack Lipofuscin (Biotium) according to the manufacturer’s protocol. Briefly, sections were treated with TrueBlack Lipofuscin diluted 20X in 70% ethanol with a 30 second to three minute incubation, then washed for five minutes three times with PBS. Sections were next blocked with 2.5% normal donkey serum (NDS, Jackson ImmunoResearch Laboratories) in PBS for one hour at room temperature. Primary antibodies were applied in PBS containing 2.5% NDS as in- dicated in Table S1 with an overnight incubation at 4ºC. The following day sections were washed for five minutes three times in 2.5% NDS in PBS, then incubated for one hour at room temper- ature with Hoechst 33342, Trihydrochloride, Trihydrate (Invitrogen) counterstain and secondary antibodies diluted as indicated in Table S1 in 2.5% NDS in PBS. Sections were next washed again in 2.5% NDS in PBS and then coverslipped with Fluoromount-G Mounting Medium (Southern- 102 Biotech) prior to imaging. Stained sections were imaged using a Ti-E inverted microscope (Nikon) and multi-field images were stitched in the NIS-Elements (Nikon) software. Each channel of the triple stained sections together with the counterstain was aligned to the DAPI image of the sequenced section using the StackReg plugin in Fiji (rigid body followed by affine). Each aligned channel of the stained section images were then separated from the counterstain and overlaid onto the DAPI channel image of the adjacent sequenced section. 3.6.13 Anatomical annotation and segmentation Annotation was performed using The Atlas of Mouse Development (122) in conjunction with mag- netic resonance images of the E14.5 embryo (123). Annotations were then confirmed using im- munostained adjacent sections (when available). Anatomical segmentation was performed man- ually using the DAPI-stained embryo section and the Big Warp function of the BigDataViewer plugin in Fiji. The region of interest was demarcated by choosing a bounding set of points in clockwise or counter-clockwise order. These points were then used to construct a polygon using the spatial features package in R. These polygons were then scaled using the same affine trans- formation used for slide registration to put them on the same coordinate axis. For polygons with holes, the contour of the entire image was first segmented followed by segmentation of each cav- ity. This same process was repeated to segment and annotate the brain regions. However, the Allen Institute’s Anatomical Reference Brain Atlas (www.atlas.brain-map.org) was used as a guide to annotate the Pallium, Sub Pallium, Midbrain, Hindbrain, Thalamus and Hypothalamus of each brain. 3.6.14 Tissue domains Tissue domains with similar cell type composition were identified using Giotto (108), an unsuper- vised tool for single-cell spatial expression analysis. The top 500 spatially autocorrelated genes (Moran’s I > 0.05, FDR < 0.001, expressed in at least 1% of cells) were identified from all slides. 103 For each slide, spatial domains were identified using the Hidden Markov random field model with parameters k=50, beta=10. To identify consensus spatial domains that existed on multiple slides, UMAP dimensionality reduction was performed based on the absolute numbers of annotated cell- types in each domain from each slide. Community detection with louvain clustering (k=5) identi- fied 22 clusters representing the 22 tissue domains. 3.6.15 Analysis of aggregated spatial positions Nuclei mapping to the same spatial grid position within each slide were aggregated by summing gene expression counts from each nucleus. These spatial positions were then treated as the columns of a gene by position count matrix. This matrix was converted into a Monocle3 CDS object, and used as input for PCA, UMAP and louvain clustering. To assess the ability of integration to recover a cell’s spatial position, we used the Seurat pack- age (version 3.2.2) and applied this analysis to Slide 14. First, nuclei mapping to a spatial position were aggregated and used to create a Seurat object bearing the spatial coordinates of each position. These spatial positions were then processed as described by the Seurat spatial vignette. Briefly, this involved running SCTransform, PCA, UMAP and clustering using the default parameters. The nuclei used to aggregate each of these positions (nuclei from Slide 14) were also processed in the same way (SCTransform, PCA, UMAP and clustering). To integrate the two datasets we used the FindTransferAnchors() function in Seurat to find anchors between the two datasets using the SCTransform as the normalization method. Finally, these anchors were used as an input to the TransferData() function to return a set of predicted spatial positions for each cell. To determine the error associated between data integration and the pre-aggregation ground truth, a cell was assigned to the position with the highest transfer probability returned by TransferData() and the euclidean distance was calculated between the most probable transferred position and the ground truth posi- tion. 104 3.6.16 Kriging gene expression Spatial grid positions were first collapsed such that non-overlapping sets of 4 adjacent positions were collapsed into a single spatial position. Each cell type within these spatial bins was then aggregated by summing the counts for each gene contributed by that cell type. These values were then kriged with the automap package in R using ordinary kriging via the autoKrige() function. Interpolated values were then rescaled to reflect the percentage of gene expression contributed by a cell type at each given position. Finally, a polygon object specific to each slide was used to clip the interpolated gene expression values. 3.6.17 Spatial autocorrelation analysis Gene spatial autocorrelation was computed by first subsetting cell types for which there were more than 100 cells present on a slide. After setting a random seed, we estimated size factors, performed PCA and UMAP dimensionality reduction with a fixed set of parameters. UMAP was run using uwot’s implementation in R with the flag fast sgd set to false. This ensured that UMAP dimension- ality reductions were consistent between runs. Following UMAP, for each subset of cells, a gene’s spatial autocorrelation was computed using either its cell’s spatial coordinates or UMAP coordi- nates as the input into Monocle3’s graph test() function. The resulting test statistic was corrected for multiple testing and genes with an FDR < 0.01 and a Moran’s I test statistic greater than 0.05 were reported as having statistically significant spatial autocorrelation. 3.6.18 RNA Fluorescence in situ Hybridization (FISH) and analysis To validate spatial expression patterns across cell types identified in our dataset, an 8-probe RNAscope HiPlex kit (Advanced Cell Diagnostics, Inc.) including probes against 6 transcripts to mark var- ious cell populations: Gad2 – GABAergic neurons; Slc17a7 – VGlut1+ glutaminergic neurons; Slc17a6 – VGlut2+ glutaminergic neurons; Pax6 – radial glia; Lum – fibroblasts; and finally, Cldn5 – endothelial cells. The kit additionally assayed for Cyp26b1 and Hoxa10 to demonstrate their spatial localization within multiple cell types and neuronal cell subtypes as indicated by our 105 sciSpace data. Briefly, serial sections near Slide 1 of the sci-Space dataset were assayed according to the manufacturer’s protocol. The fresh-frozen tissue sections were fixed using 4% paraformalde- hyde (Electron Microscopy Sciences) in 1X PBS, dehydrated, and treated with the Protease IV kit component. Following the manufacturer’s specified hybridization steps, counterstaining, and cov- erslipping, the first four probes: 1) Cyp26b1, 2) Gad2, 3) Slc17a6, 4) Hoxa10 were imaged. Scans were obtained using an Aperio VERSA slide scanner (Leica Biosystems) with 40x magnification and DAPI, FITC, Cy3, Cy5, and Cy7 filter sets. Coverslips were removed, the first four fluo- rophores were cleaved, the fluorophores for probes 5-8 were hybridized: 5) Pax6, 6) Slc17a7, 7) Lum, 8) Cldn5. Slides were re-coverslipped and imaged as before. Slide scans were analyzed using QuPath 0.2.3 quantitative pathology and bioimage analysis freeware. Briefly, the two scans for each slide were imported and affine transform matrices for alignment were obtained using: Analyze/Interactive image alignment/autoalign/Estimate trans- form; with settings: Registration type -> Affine transform, Alignment type -> Image intensity, Pixel size -> 20. Positive signal foci were identified using Analyze/Cell detection/Cell detection with the setting parameters designated in Table S2. Parameters were manually adjusted to allow for detection of multiple foci in clusters with a bias toward avoiding false positive detection events. Coordinates of detected foci in each channel were output as .tsv files using the Counting function (Counting/Convert detections to points, followed by Counting/Save points). Because the assay lacked a counterstain that could be used to delineate cell boundaries, we relied on transcript proximity to designate positions positive for expression and coexpression. Co- ordinates were analyzed in R to identify transcript locations within putative expressing cells and to estimate the positions of coexpression between transcripts. Briefly, foci located within an approx- imate sub-cellular length (10 um) of two other foci for the same probe were considered transcripts in an expressing cell, i.e. were designated as positive points of expression. To detect coexpression a similar analysis was performed where a positive marker gene (Gad2, Slc17a6, Pax6, Slc17a7, Lum, or Cldn5) position was considered to be coexpressed with Cyp26b1 when it lied within 10 um of two positive positions for Cyp26b1. For Hoxa10, which was more lowly expressed than the other assayed genes, a positive designation for expression was given when at least one other 106 Hoxa10 position fell within 10 um of a focus, and a coexpression designation required one positive Hoxa10 position within 10um of a given positive marker gene position. Expression and coexpres- sion mapping was compared between the RNAscope and sci-Space datasets. 3.6.19 Variance decomposition model The variance of gene expression within and between cell populations was computed using the angular distance metric. Specifically, let y = the vector of log-scaled gene expression levels for each gene in a cell, normalized to be of unit magnitude, let E(Y ) = the arithmetic mean of log-scaled gene expression levels normalized to be of unit magnitude, and let y = the empirical distribution of Y across all cells in a given analysis. Then: θ(a,b) := 2 pi cos−1 ( A ·B ||A||||B|| ) Var(Y ) := E[θ(Y −E[Y ])2] This variance statistic behaves similarly to the variance of a univariate distribution. Most impor- tantly, it obeys the Law of Total Variance (Figure 3.46C). If cells sampled from Y are partitioned into groups X, then: Var(Y ) =E[Var(Y |X)]+Var(E[Y |X ]) = weighted average of within-group variance + variance of averages among groups The “variance explained” by a grouping X, e.g. grouping cells by cell type, can therefore be computed as: 107 Variance explained =1− within-group variance total variance =1− E[Var(Y |X)] Var(Y ) This naive formula for variance explained is strongly affected by the number of groups in the grouping X, similar to how the variance explained in a regression model is affected by the number of degrees of freedom in the model. If one divides a population of cells into random groups of two, variance explained will be high, even though the grouping has no biological meaning. We can correct for this by using an adjusted formula: Variance explained = 1− E[Var(Y |X)] E[Var(Y | permuted X)] where in the denominator, the group id associated with each cell is randomly permuted. This adjusted formula corrects for the number of groups in X, but is confounded by another factor. The observed variance between cell gene expression vectors is a result of both biological heterogeneity and technical factors such as the sparsity of the single cell RNA-seq data. If the same biological sample is profiled in two different experiments, and the median number of UMIs per cell is N in experiment 1 and 4N in experiment 2, then the variance explained by the grouping X = cell type will be lower in experiment 1 vs. 2 due to increased sparsity. To correct for sparsity and estimate the proportion of biological variance explained by grouping X, we use a final adjusted formula: Variance explained = 1− E[Var(Y |X)]−E[Var( resampled Y |X)] E[Var(Y | permuted X)]−E[Var(resampled Y | permuted X)] In this formula, the “resampled Y” terms are computed by replacing each cell’s gene expression 108 vector y with a sample from the distribution Multinomial .The distribution of the multinomial is parameterized by the two parameters, n the number of draws and p a vector of probabilities (p1,p2, . . .pi) or the probability of drawing each gene. For each simulated cell from a given group, n is set to match the number of UMIs measured from a given cell and p is calculated by dividing the number of UMIs for a given gene in that cell grouping, by the total number of UMIs from that group. Resampling in this manner is a way of estimating what variance one would observe if there were no biological heterogeneity in a group of cells and the only source of observed heterogeneity was sparsity. Figure 3.47 applies this formula to estimate the proportion of biological gene expression vari- ance explained by cell type; the proportion of biological gene expression variance explained by cluster; the proportion of biological gene expression variance explained by spatial position, us- ing the grouping X = a spatial spatial bin (non-overlapping 2 ∗ 2 spot squares); the proportion of biological gene expression variance explained by spatial position and cell type where X = the combination of spatial bin and cell type. Figure 3.4B applies this formula to estimate the pro- portion of the residual biological gene expression variance, after accounting for cell type, that is explained by spatial position. In these models, the grouping X= (cell type, spot id), and permuted X = (cell type, permuted spot id). By permuting the spot id but not the cell type annotation, we ensure that our estimate is of the variance explained by the cell type + space model relative to the cell-type-only model, rather than relative to a null model. To estimate the variance that can be explained by a null (shuffled) model, we perform this permutation procedure 50 times and take the average of these trials. 3.6.20 Pairwise angular distance Only cell types with 100 cells or more, originating from a single slide were considered for this analysis. Each cell was size factor normalized, and scaled to the unit hypersphere. The pairwise angular distance was then calculated as detailed above. To test whether there was a relationship between angular distance and physical distance, a linear model was fit with angular distance as 109 the response and physical distance as the sole predictor (angular.distance distance). Reported p-values indicate the significance of coefficient for the physical distance predictor variable using the Wald linear regression test. 3.6.21 Spatial gene modules Gene module analysis in Figure 3.48 comparing cell type derived gene modules and spatial gene modules was performed on Slide 14. Briefly, we performed PCA, UMAP dimensionality reduc- tion and clustering on nuclei mapping to Slide 14. Genes autocorrelated in the UMAP embedding were then calculated using Monocle3’s graph test() function. Similarly, genes autocorrelated in spatial-position were calculated using the sci-Space derived spatial coordinates as the input into Monocle3’s graph test() function. We performed module analysis on the union of genes with sig- nificant autocorrelation in the UMAP embedding and spatial position (FDR < 0.05). UMAP gene modules were recovered using Monocle3’s find gene modules() function on the cell*gene matrix. Spatial gene modules were recovered using Monocle3’s find gene modules() on the position*gene matrix. Gene expression from each set of discovered modules was then aggregated using Monocle 3’s aggregate gene expression() function and visualized as row- and column-clustered heatmaps. 3.6.22 Neuronal trajectory analysis Neuronal trajectories mapping to the brain and classified as either Radial Glia, Neurons or Glial Cells were subset from the neural lineage partition. PCA, followed by UMAP dimensionality re- duction was performed on this subset. Monocle3 was then used to learn a principle graph on the UMAP embedding. For pseudotime inference, the root of the trajectory was selected by manu- ally identifying principal graph nodes occupied by radial glia. Each trajectory was then scaled for display purposes. These nuclei were then mapped to brain regions based on segmentation per- formed using the Allen Institute’s Anatomical Reference Brain Atlas (www.atlas.brain-map.org). To calculate pseudotime dependent genes, nuclei mapping to the Pallium, Sub Pallium and Mid- brain were first subsetted. For each brain region and the trajectory contained within it, pseudotime 110 dependent genes were then recovered by fitting a natural spline with three degrees of freedom using pseudotime as the predictor. Genes with at least one significant knot (FDR < 0.01) were deemed pseudotime dependent. This process was repeated for all three trajectories. The results of these tests are provided in File S3. The intersection of significant genes across all three tra- jectories is shown in Figure 3.5F . For gene ontology analysis, the dendrogram producing the row-clustering in Figure 3.5F was cut to produce 4 groups. Each group was then provided as input into http://geneontology.org. Displayed gene ontology terms were chosen from a list of significant terms (FDR < 0.01). 3.7 SUPPLEMENTAL FIGURES Figure 3.6: Comparison of methods for spatial transcriptomics. Schematics and key attributes of the two major classes of contemporary spatial transcriptomics methods, as well as sci-Space. (A) Spatial transcrip- tome capture (STC) methods, e.g. the original “spatial transcriptomics” method (93) and Slide-seq (94). (B) Methods relying on multiplexed mRNA imaging, e.g. MERFISH (95), seqFISH (96), and FISSEQ (97). (C) sci-Space. 111 Figure 3.7: Comparison of methods for spatial transcriptomics. (A) Schematic of spotted oligos with SYBR green fluorescent dye marked positions labeled in green. All positions contain a combination of location-informative “hashing” sequences (single stranded DNA) with polyA tails. (B) Overlay of bright- field and fluorescence image of the same position. (C) Average radius and spot-to-spot distance computed from imaged slides. (D) Diagram of hierarchical barcoding approach where each position is marked by a unique combination of one of 16 sector barcodes (colors) and one of 1536 spot sequences. (E) An example displaying a single spot oligo barcode (white square) which is in 5 different sectors. Scale bar (B) = 0.1 mm. 112 Figure 3.8: SYBR green waypoints transfer to DAPI stained embryo. (A) Permeabilized mouse embryo section receives (B) SYBR green waypoints spotted at a single section. (C) The resulting transfer and imaging shows the location of each waypoint on the DAPI stained section. Dashed white lines denote the approximate location of each sector. Scale bar in panel (C) = 0.5 mm. 113 Figure 3.9: Labeling of cryosectioned tissues with hash oligos from an agarose coated slide is com- patible with sci-RNA-seq. Slides from sections of the developing mouse embryo were first labeled with a slide specific oligo and then labeled with another hash oligo from a space-grid containing a single hash oligo at varying concentrations. Replicates are independent experiments performed on different days using tissue sections from a single batch. (A) RNA UMIs recovered per cell across stages and replicates. (B) Number of cells sampled from each slide across stages and replicates. (C-E) UMAP embedding colored by (C) embryonic stage, (D) replicate or (E) expression of skeletal muscle marker Titin (Ttn). (F) Correlation of RNA UMIs recovered per gene between replicates at different stages. (G) Hash UMIs recovered per cell of oligo spotted at 10µM, 20µM, 25µM and 50µM concentrations. 114 Figure 3.10: Spotted space-grids are reproducible. (A) HEK293T nuclei were exposed to hash-oligos dissolved from one of 3 space-grids. (B) Correlation between spot oligo counts originating from different slides. (C) Distribution of sector oligos observed per cell, broken out by replicate. (D) Cartoon depicting control experiment with human and mouse cell lines grown on a glass slide and barcoded with a single hash-oligo. (E) Scatter plot depicting the number of human (X axis) or mouse (Y axis) unique molecular identifiers (UMIs) detected per cell. Nuclei were filtered for those mapping to a single slide. (F) The percentage of contaminating UMIs for each cell type. 115 Figure 3.11: Spotted space-grids are reproducible. (A) Spatially indexed slides, “space-grids,” were fabricated by spotting unique combinations of hashing oligos onto agarose membrane-coated slides. (B) Permeabilized fresh-frozen tissue sections (C) received the spatially-defined pattern of oligos by diffusion from the space-grids when the oligo-laden agarose and tissue section were sandwiched together between their carrier slides. 116 Figure 3.12: sci-Space workflow for sequencing library preparation and demultiplexing transcripts allows transcripts and spatial positions to be assigned to individual nuclei. (A) Hashing oligos or bar- codes are transferred to nuclei as determined by nuclei positions relative to the barcode array. (B) Nuclei from each slide are dissociated and labelled with an additional slide-specific barcode. (C) Transcripts and barcodes are tagged with nuclei-specific indices according to the sci-RNA-seq protocol (120). Note that the green segment shown in the schematic includes combinatorial indexing barcodes introduced during sci- RNA-seq. (D) Barcodes and transcripts from all nuclei are pooled and sequenced. (E) Indices are used to demultiplex transcripts and barcodes, which allow for the assignment of each nucleus to its slide, sector, and spot of origin. 117 Figure 3.13: Cellular hashing distinguishes low RNA UMI nuclei from aggregates and uniquely marks a nucleus’ position. (A) Cumulative distribution of RNA UMIs (unique molecular identifiers) identified after sequencing. Red line corresponds to a lenient cutoff used for initial calls. (B) Violin density plots displaying the distribution of cells which were labeled by single slide-specific hash oligo. Cells which failed to show enrichment of a single slide-specific hash oligo were filtered out. Red line corresponds to the same cutoff displayed in panel (A). (C-F) Heatmaps displaying the average hash oligo counts from the top spot and surrounding positions for 1000 randomly sampled cells from Slide 1 (C), Slide 4 (D), Slide 13 (E) and Slide 14 (F). Histograms reflect the marginal sums for rows or columns. 118 Figure 3.14: sci-Space sequenced cells have complex transcriptomes and separate into the major cell types. (A) Boxplots displaying the number of unique molecular identifiers recovered (red) and genes de- tected (blue) for cells from each slide. (B-C) Louvain clustering result and the corresponding proportions of cells found in each cluster. (D-E) UMAP embedding colored by the slide of origin (D) or embryo of origin (E). 119 Figure 3.15: Comparison of recovery of nuclei from sequencing and estimated nuclei present upon imaging. (A) Estimates of DAPI stained nuclei present per spot. Per spot estimates for Slide 3 are shown as an example. (B) Comparison of the total number of estimated nuclei present versus the number of nuclei recovered from each slide. Method for seeding cells during sci-RNA-seq is noted below the x-axis. An estimate of nuclei count from slides 5 and 12 could not be computed due to corruption or loss of the image file. 120 Figure 3.16: Automated cell type annotation and concordance between methods. (A) Estimates of DAPI stained nuclei present per spot. Per spot estimates for Slide 3 are shown as an example. (B) Comparison of the total number of estimated nuclei present versus the number of nuclei recovered from each slide. Method for seeding cells during sci-RNA-seq is noted below the x-axis. An estimate of nuclei count from slides 5 and 12 could not be computed due to corruption or loss of the image file. 121 Figure 3.17: DAPI stained images of sci-space sequenced slides.The slide number and the embryo from which they originate are displayed above each image. The image for Slide 12 was lost during data transfer. SYBR green point layouts varied between space-grid prints, and positioning of the sections relative to the oligo array also varied. Scale bars = 0.5 mm. 122 Figure 3.18: Co-registration procedure of imaged section and space-grid. (A) DAPI stained E14.0 section (Slide 3) with SYBR green points imaged in the GFP-channel. Matched SYBR green waypoints between the image and (B) the intended SYBR pattern on an ideal space-grid are used to calculate an affine- transformation. (C) Co-registered imaging data with inferred positions overlayed with image with inset highlighted (D). Scale bars in panels (A) and (C) = 1 mm. 123 Figure 3.19: Spatial distribution of nuclei per slide in addition to key spatial metrics.Histograms dis- playing the number of nuclei recovered per position for each slide are shown next to a heatmap of the cells recovered per spatial position. In text, above and beside each histogram, summary spatial statistics are noted for each slide. An asterisk is displayed next to slides that were processed using the optimized sci-Space protocol with an extra sonication step and nucleus seeding via dilution (as opposed to FACS). 124 Figure 3.20: Segmented organs from each embryo section. Segmented regions are highlighted on each section and colored according to the denoted anatomical structure. 500um scale bar shown in the bottom right corner of each image. 125 Figure 3.21: Alignment of immunostaining from adjacent cryosections. (A-D) Serial cryosections adja- cent to the sequenced embryo tissue sections immunostained for sarcomeric alpha-actinin (green), cardiac alpha-myosin heavy chain (red), and E-cadherin (magenta) with a Hoechst (blue) counterstain. (E-H) Affine transform alignment allowed for overlay of the staining on the image of the sequenced and for identification of tissue structures, as seen in (I,J), the cardiac region insets from panels (D,H) respectively. Scale bars: (A-H) = 1 mm, (I, J) = 0.5 mm. 126 Figure 3.22: Spatial position of cell types for Slide 1 from embryo . Sequenced cells, broken out by cell type, are placed onto the imaged embryo at the spatial position to which they map on top. Top 16 abundant cell types are shown with a 500um scale bar (bottom right). 127 Figure 3.23: Spatial position of cell types for Slide 4 from embryo 1. Sequenced cells, broken out by cell type, are placed onto the imaged embryo at the spatial position to which they map on top. Top 16 abundant cell types are shown with a 500um scale bar (bottom right). 128 Figure 3.24: Spatial position of cell types for Slide 5 from embryo 1. Sequenced cells, broken out by cell type, are placed onto the imaged embryo at the spatial position to which they map on top. Top 16 abundant cell types are shown with a 500um scale bar (bottom right). 129 Figure 3.25: Spatial position of cell types for Slide 6 from embryo 1. Sequenced cells, broken out by cell type, are placed onto the imaged embryo at the spatial position to which they map on top. Top 16 abundant cell types are shown with a 500um scale bar (bottom right). 130 Figure 3.26: Spatial position of cell types for Slide 7 from embryo 1. Sequenced cells, broken out by cell type, are placed onto the imaged embryo at the spatial position to which they map on top. Top 16 abundant cell types are shown with a 500um scale bar (bottom right). 131 Figure 3.27: Spatial position of cell types for Slide 8 from embryo 1. Sequenced cells, broken out by cell type, are placed onto the imaged embryo at the spatial position to which they map on top. Top 16 abundant cell types are shown with a 500um scale bar (bottom right). 132 Figure 3.28: Spatial position of cell types for Slide 8 from embryo 1. Sequenced cells, broken out by cell type, are placed onto the imaged embryo at the spatial position to which they map on top. Top 16 abundant cell types are shown with a 500um scale bar (bottom right). 133 Figure 3.29: Spatial position of cell types for Slide 8 from embryo 1. Sequenced cells, broken out by cell type, are placed onto the imaged embryo at the spatial position to which they map on top. Top 16 abundant cell types are shown with a 500um scale bar (bottom right). 134 Figure 3.30: Spatial position of cell types for Slide 8 from embryo 1. Sequenced cells, broken out by cell type, are placed onto the imaged embryo at the spatial position to which they map on top. Top 16 abundant cell types are shown with a 500um scale bar (bottom right). 135 Figure 3.31: Spatial position of cell types for Slide 8 from embryo 1. Sequenced cells, broken out by cell type, are placed onto the imaged embryo at the spatial position to which they map on top. Top 16 abundant cell types are shown with a 500um scale bar (bottom right). 136 Figure 3.32: Spatial position of cell types for Slide 8 from embryo 1. Sequenced cells, broken out by cell type, are placed onto the imaged embryo at the spatial position to which they map on top. Top 16 abundant cell types are shown with a 500um scale bar (bottom right). 137 Figure 3.33: Spatial position of cell types for Slide 8 from embryo 1. Sequenced cells, broken out by cell type, are placed onto the imaged embryo at the spatial position to which they map on top. Top 16 abundant cell types are shown with a 500um scale bar (bottom right). 138 Figure 3.34: Spatial position of cell types for Slide 8 from embryo 1. Sequenced cells, broken out by cell type, are placed onto the imaged embryo at the spatial position to which they map on top. Top 16 abundant cell types are shown with a 500um scale bar (bottom right). 139 Figure 3.35: Cell types enriched in annotated anatomical segments. Heatmap displaying the percentage of cells of a given type mapping to an anatomical annotation. Unannotated denotes regions of the embryo which were not assigned to an anatomical annotation. (OPCs - Oligodendrocyte Progenitor Cells; CTPs - Connective Tissue Progenitors). 140 Figure 3.36: Tissue domains based on similar cell type compositions are found across the embryo. (A) UMAP dimensionality reduction displaying the tissue domains, identified using the hidden Markov random field model from the Giotto (108) package, from each slide based on similar cell type composition in spatial locations. Each point is a single domain from a single slide and is colored according to cluster assignment. (B) Cells are colored based on membership to a tissue domain in four representative slides. (C) Heatmap displaying the percentage of a cell-type that composes each tissue domain. (D) The top five cell types that compose each tissue domain are shown with absolute counts. 141 Figure 3.37: Comparison of traditional ISH and digital in-situs. Expression for a given gene (rows) plot- ted for slides from embryo 1 (slides 11,1,14 and 8) (columns) compared to published in-situ hybridization data (109) on matched sections. Genes with high spatial autocorrelation in different tissues were chosen for display. 500um scale bar shown at the bottom right of sci-Space data. 142 Figure 3.38: Comparison spatial single cell data and aggregated spatial data. (A,B), UMAP dimension- ality reduction of cells aggregated by spatial positions from all slides. Aggregated spatial positions which contain cells originally mapping to a single UMAP cluster are colored red in panel (A), and spatial posi- tions consisting of a single cell type label are colored by their corresponding label in panel (B). Positions containing cells with membership with multiple clusters or cell types are colored in grey. (C) UMAP di- mensionality reduction of single cells sampled from each spatial position, colored by cell type label. (D,E) Seurat v3 aggregated spatial positions from Slide 14 only. (D) displayed in UMAP space and colored by cluster or (E) displayed in their spatial positions on the embryo. (F) Distance between the Seurat v3 highest probability mapping of a single cell onto the aggregated spatial positions and a cell’s recovered position by sci-Space. Distance calculated between positions on the spatial grid. (G,H) Comparison between sci-Space data (v1 refers to slides 1-6 and v1.1 refers to slides 7-14), 10x Visium (mouse cortex) data and Slide-seq-v2 (hippocampus) data displaying UMIs/mm2 in panel (G) or genes detected per spatial feature in panel (H). (I) The current distribution of UMIs/cell of sci-Space data (4,831 reads/cell) versus deeper sequencing (43,536 reads/cell) performed on a small scale sci-Space experiment. 143 Figure 3.39: Dissection labels transferred from a developing mouse brain atlas to cells in the sci-Space dataset. The sci-Space dataset was co-embedded with the developing mouse brain atlas dataset from La Manno et al (103). The anatomical dissection label was then transferred from the developing mouse brain atlas dataset to the sci-Space dataset. This was done using the majority label of each cell’s five nearest neighbors in the co-embedded space. Each facet shows a different slide with a contour of the brain shown to the right of each facet. 144 Figure 3.40: Transfer of spatial labels from a developing mouse brain atlas to the sci-Space dataset – Slide 14. The sci-Space dataset was co-embedded with the developing mouse brain atlas dataset from La Manno et al. (103). The highly resolved UMAP cluster label was then transferred from the developing mouse brain atlas dataset to the sci-Space dataset. This was done using the majority label of each cell’s five nearest neighbors in the co-embedded space. Each facet displays all sci-Space cells (in grey) with highlighted cells (in red) bearing the transferred UMAP cluster. Only clusters with greater than 5 cells are shown. 145 Figure 3.41: Transfer of spatial labels from a developing mouse brain atlas to the sci-Space dataset – Slide 13. The sci-Space dataset was co-embedded with the developing mouse brain atlas dataset from La Manno et al. (103). The highly resolved UMAP cluster label was then transferred from the developing mouse brain atlas dataset to the sci-Space dataset. This was done using the majority label of each cell’s five nearest neighbors in the co-embedded space. Each facet displays all sci-Space cells (in grey) with highlighted cells (in red) bearing the transferred UMAP cluster. Only clusters with greater than 5 cells are shown. 146 Figure 3.42: Connective tissue progenitors are composed of multiple subtypes with distinct spatial distributions. (A) Sub-clustering of the differentiating mesenchyme cells with cells colored by cluster. (B) Position of 3 selected sub-clusters that show differing spatial restriction within the embryos, with subcluster 1 primarily in the limbs, subcluster 7 focused in the face, and subcluster 8 more dispersed. 147 Figure 3.43: Connective tissue progenitors are composed of multiple subtypes with distinct spatial distributions. Subtypes of the neurons from Slide 1 annotated using developing mouse brain and spinal cord atlases (103, 124) plotted (A) by UMAP and (B) spatially. (C) log10-scale boxplot of the UMAP Moran’s I statistic for Hox genes displayed in (Figure 3.3B) versus all other expression level-matched genes (p-value < 0.01, two sided t-test). (D) UMAP embedded expression patterns for HoxA genes. (E) UMAP embedding of neurons from the E13.5 stage of mouse development from a published single cell spinal cord dataset (124). Colors and labels mark neuron subtypes annotated by the authors. (F) Expression of the same HoxA cluster displayed in panels (A-B). 148 Figure 3.44: Connective tissue progenitors are composed of multiple subtypes with distinct spatial dis- tributions. (A) Spatial patterns of Cyp26b1 and Hoxa10 gene expression for neurons from Slide 1 and (B) corresponding expression patterns across neuronal cell states/subtypes (UMAP embedded). (C) Cyp26b1 and marker gene coexpression patterns measured by sci-Space (top row) were validated by comparison with co-expression detected with RNA FISH (bottom row). Cell types were designated using the following marker genes: Cldn5 - endothelial cells, Gad2 - GABAergic (inhibitory) neurons, Lum - vascular and lep- tomeningeal cells (VLMCs), Pax6 - radial glia, Slc17a6 and Slc17a7 - Glutamatergic (excitatory) neurons. RNA FISH supported the spatial mapping of sci-Space transcriptomes expressing Cyp26b1 and Hoxa10 and the assayed marker genes. (D) Expression patterns (UMAP) for a subset of spatially but not cell state re- stricted gene across neuronal cell states (Slide 1). (E) Analysis of Slide 14 shows consistent spatial versus UMAP log10(Morans I) values as compared to Slide 1 (Figure 3.3E). 149 Figure 3.45: Similarity between pairs of transcriptomes as a function of spatial distance. (A,B) Pairwise angular distance between cells broken out by the distance in millimeters between cells mapping back to the spatial grid. Boxplots display all cell types with over 100 cells in (A) Slide 1 and (B) Slide 14. Stars denote significant distance coefficient in linear regression without the application of an estimate filtering step (* : p-value < 0.01, ** : p-value < 0.001, *** : p-value < 0.0001, Wald linear regression test; Methods). 150 Figure 3.46: Re-examining the effect of lineage on transcriptome in the developing C. elegans dataset using angular distance. (A) Pairwise angular distance between cells in the C. elegans AB lineage (85). Boxplots denote lineage relationship between the pair of cells. (B) Relationship between Jenson-Shannon distance and angular distance calculated between all pairs of cells displayed in panel (A). (C) Empirical confirmation of the Law of Total Variance. Red bars indicate the sum of unexplained and explained variance after grouping cells that shared a common parent. Blue bar denotes the average mean squared error from the global mean. (D) Variance explained for cells in the AB lineage after accounting for parental identity. Variance explained statistic was computed as described in Methods. 151 Figure 3.47: Variance explained by the cell type label, spatial label, or both. Proportion of variance explained, apart from that attributable to sparse UMI sampling, as described in Methods. Each spatial bin corresponds to 4 adjacent spots that are collapsed. Spatial bins that only contained a single cell of a given cell type were excluded from the analysis. 152 Figure 3.48: Variance explained by the cell type label, spatial label, or both. (A)Gene modules were re- covered by using either the aggregated spot by gene count matrix (Spatial) or the cell-by-gene count matrix (UMAP). Panel A shows the distribution of genes per module and the adjusted rand index calculated on the two groupings. (B,C) Row- and column-clustered heatmap of aggregated (B) UMAP-derived module ex- pression (rows) per UMAP cluster (columns) or (C) spatially-derived module expression (rows) per UMAP cluster (columns). Color bar on top corresponds to the majority cell type within the UMAP cluster. (D) Row- and column-clustered heatmap of aggregated, spatially-derived module expression (rows) per spatial position (columns). Color bar on top corresponds to the majority cell type within each spatial position. 153 Figure 3.49: Subtypes of chondrocytes are spatially restricted. (A) Sub-clustered chondrocytes colored by cluster. Colored arrowheads highlight accumulations of cluster 6 (blue) and cluster 7 (purple). (B) Log2- fold change of genes expressed in cluster 3 (left) and cluster 6 (right) relative to a background distribution of sampled chondrocyte cells. 154 Figure 3.50: Characteristics of neural pseudotemporal trajectory. (A-D) UMAP embedding of neural cells in the trajectory colored by (A) aggregate expression of G1/S markers, (B) aggregate expression of G2/M markers, (C) selected gene expression with documented roles in neuronal migration, or (D) age of transferred nearest neighbors in the developing brain atlas dataset (103). (E) Scatter plot displaying trans- ferred age and pseudotime for each of the three trajectories with a least squares regression fit (blue). 155 Figure 3.51: Spatial location of cortical neural trajectories. Neurons and radial glia identified in the neural trajectories are shown colored from navy blue (early in pseudotime) to yellow (late in pseudotime), while neurons and radial glia not found in the trajectory are colored grey. 156 ‘ 157 Chapter 4: CONCLUSION 4.1 NEW MODELS FOR A NEW ERA The sequencing of the human genome led to a sea change in the way that we perform science. Another sea change, which has yet to realize its potential, is the training of highly flexible compu- tational models for generalized problems in biology. Epitomized by the training of the AlphaFold algorithm(125), computational models are beginning to perform complex tasks with startling ac- curacy. In the case of AlphaFold, the development of new model architectures, along with the shedding of old heuristics, has led to the development of an astonishingly accurate algorithm for predicting protein structure. Although Alphafold currently performs the task of protein structure prediction, there is a clear opportunity for such an algorithm to perform de novo protein design – the task of designing a sequence given a target structure. In another avenue, the success of the lan- guage model GPT3, is predicated on learning 175 billion model parameters by training on massive amounts of diverse training data. The result is an algorithm that can autocomplete text, answer questions, and generate language de novo capable of passing the Turing test. In human genetics efforts to achieve a similar scale are currently in progress. To date, the Ex- ome Aggregation Consortium (ExAC) database and the Genome Aggregregation Database (Gno- mAD) (126) have amassed a total of 125,748 exome sequences and 15,708 whole genome se- quences. Importantly, this data is harmonized and filtered for high quality sequencing data allowing for its use in future efforts. Furthemore, the incorporation of metadata pertinent to the sequenced individual(127) has led to the development of data driven models that link disease observed in the human population to the underlying genotypic variants. Comparable efforts are currently underway to sequence diverse human populations which will only add to the value of this resource(128). 158 4.2 HARMONIZING SINGLE CELL GENOMICS Analogous to the commercial activity that led to the proliferation of next generation sequencing, the commercialization of single cell genomics platforms has led to their widespread adoption. This means that the amount of high quality, single cell data produced will also increase exponentially as the price of single cell library preparation and sequencing drop. Accordingly, several groups have already invested in the development of software for the integration of different single cell datasets(129, 130). These algorithms operate by matching similar cells found in the two datasets to derive a new projection. The result of this procedure means that any integration between two unique sets of data, is itself unique. To solve this problem there have been attempts to produce a reference atlas of cells. The idea is that when a single cell experiment is performed, cells can be mapped onto this reference, much like the mapping of DNA fragments to the genome. This procedure, which was recently deployed on circulating peripheral blood mononuclear cells(131), was aided by deep characterization of the cells in the reference atlas. In this study, the authors measured surface protein expression in conjunction with the cell’s transcriptome to produce a cellular atlas resolved by both modalities. Notably, measurement of surface marker expression in PBMCs(132) allows for the integration of a deep knowledge base, which has been developed by immunologists over decades. The result is a dataset that not only allows for computational integration, but also integration with scientists in the field of immunology. This synergy will surely help harmonize existing immunology paradigms with the new, high resolution measurements that are being made currently. 4.3 STRUCTURE-FUNCTION The key insights from Hao et. al’s attempt to define a reference is twofold: 1) The field needs a stable reference to facilitate comparisons and consistent annotations, and 2) this reference needs to be readily integratable into existing systems of knowledge. However, the chosen measurement set (surface protein expression and gene expression) may be uniquely suited to PBMCs. Unlike cells in solid tissues and organs, PBMCs do not have a definitive spatial context, thus the omission 159 of spatial context is not an issue. However, for the mapping of cells from non-liquid tissues and organs, the capture of spatial information will be necessary. To address this problem, I believe that the field must start a collective effort to construct a set of single cell transcriptomic atlases for each organism that are both spatially- and lineally- resolved. This set of atlases spanning the stages of development, adolescence, and adulthood, would effectively provide a set of reference “structures” for each organism. In this structural representation each cell would be akin to an atom in an atomic structure and different cell types would stand in for different atom types. The connectivity between cells could either be represented as the spatial connectivity between cells or the lineal relationships between them. The development of such a resource, modeled on the genome projects, could then act as the reference organismal atlas. Cells sequenced at any future time could then get mapped to this ref- erence atlas. Furthermore, this representation of an organism could act as a cipher’s key, unifying sequencing based readouts and imaging based readouts. This includes opportunities for training machine learning models that convert from another imaging modality (such as histology or radio- graphic imaging) to single cell data, and vice-versa. Finally, such a representation would be useful in training models like AlphaFold. Instead of learning the structures of proteins, a model could be trained to learn the connectivities between cells that form tissues and how tissues form organs. Analogous to the fold in protein structure, spatial motifs of cell types could be identified and established at the cellular and molecular levels. Truly fantastic extensions of such a model would enable the de novo design of new tissue architectures, and eventually new organisms. However, to achieve this future, the training data needed to train such a model would need to be diverse – spanning the tree of life and various pathological states. After solving the structure of the organism, the number of training examples needed to train a useful model will require new measurement technologies. These technologies will need to be ap- plicable across many tissue contexts, fast and highly multiplexed. Although the technologies pre- sented in this work begin to achieve these characteristics, there are many refinements still needed to advance this lofty vision. 160 161 BIBLIOGRAPHY (1) Erika Check Hayden. “Technology: The $1,000 genome”. en. In: Nature 507.7492 (Mar. 2014), pp. 294–295. (2) E S Lander et al. “Initial sequencing and analysis of the human genome”. en. In: Nature 409.6822 (Feb. 2001), pp. 860–921. (3) Karen H Miga et al. “Telomere-to-telomere assembly of a complete human X chromo- some”. en. In: Nature 585.7823 (Sept. 2020), pp. 79–84. (4) Glennis A Logsdon et al. “The structure, function and evolution of a complete human chromosome 8”. en. In: Nature 593.7857 (May 2021), pp. 101–107. (5) Mouse Genome Sequencing Consortium and Mouse Genome Sequencing Consortium. Ini- tial sequencing and comparative analysis of the mouse genome. 2002. (6) Chimpanzee Sequencing and Analysis Consortium. “Initial sequence of the chimpanzee genome and comparison with the human genome”. en. In: Nature 437.7055 (Sept. 2005), pp. 69–87. (7) Kerstin Howe et al. “The zebrafish reference genome sequence and its relationship to the human genome”. en. In: Nature 496.7446 (Apr. 2013), pp. 498–503. (8) J C Venter. GENOMICS: Shotgun Sequencing of the Human Genome. 1998. (9) V E Velculescu et al. “Serial analysis of gene expression”. en. In: Science 270.5235 (Oct. 1995), pp. 484–487. (10) Paul Bertone et al. “Global identification of human transcribed sequences with genome tiling arrays”. en. In: Science 306.5705 (Dec. 2004), pp. 2242–2246. (11) Jill Cheng et al. “Transcriptional maps of 10 human chromosomes at 5-nucleotide resolu- tion”. en. In: Science 308.5725 (May 2005), pp. 1149–1154. (12) David S Johnson et al. “Genome-wide mapping of in vivo protein-DNA interactions”. en. In: Science 316.5830 (June 2007), pp. 1497–1502. (13) Gregory E Crawford et al. “Genome-wide mapping of DNase hypersensitive sites using massively parallel signature sequencing (MPSS)”. en. In: Genome Res. 16.1 (Jan. 2006), pp. 123–131. (14) Gregory E Crawford et al. “DNase-chip: a high-resolution method to identify DNase I hypersensitive sites using tiled microarrays”. en. In: Nat. Methods 3.7 (July 2006), pp. 503– 509. (15) Alan P Boyle et al. “High-resolution mapping and characterization of open chromatin across the genome”. en. In: Cell 132.2 (Jan. 2008), pp. 311–322. 162 (16) Jay Shendure et al. “DNA sequencing at 40: past, present and future”. en. In: Nature 550.7676 (Oct. 2017), pp. 345–353. (17) Evan A Boyle, Yang I Li, and Jonathan K Pritchard. “An Expanded View of Complex Traits: From Polygenic to Omnigenic”. en. In: Cell 169.7 (June 2017), pp. 1177–1186. (18) Amit V Khera et al. “Genome-wide polygenic scores for common diseases identify in- dividuals with risk equivalent to monogenic mutations”. en. In: Nat. Genet. 50.9 (Sept. 2018), pp. 1219–1224. (19) Yansheng Liu, Andreas Beyer, and Ruedi Aebersold. “On the Dependency of Cellular Protein Levels on mRNA Abundance”. en. In: Cell 165.3 (Apr. 2016), pp. 535–550. (20) Fuchou Tang et al. “mRNA-Seq whole-transcriptome analysis of a single cell”. en. In: Nat. Methods 6.5 (May 2009), pp. 377–382. (21) Diego Adhemar Jaitin et al. “Massively parallel single-cell RNA-seq for marker-free de- composition of tissues into cell types”. en. In: Science 343.6172 (Feb. 2014), pp. 776–779. (22) Simone Picelli et al. “Smart-seq2 for sensitive full-length transcriptome profiling in single cells”. en. In: Nat. Methods 10.11 (Nov. 2013), pp. 1096–1098. (23) D S Tawfik and A D Griffiths. “Man-made cell-like compartments for molecular evolu- tion”. en. In: Nat. Biotechnol. 16.7 (July 1998), pp. 652–656. (24) Evan Z Macosko et al. “Highly Parallel Genome-wide Expression Profiling of Individual Cells Using Nanoliter Droplets”. en. In: Cell 161.5 (May 2015), pp. 1202–1214. (25) Allon M Klein et al. “Droplet barcoding for single-cell transcriptomics applied to embry- onic stem cells”. en. In: Cell 161.5 (May 2015), pp. 1187–1201. (26) Darren A Cusanovich et al. “Multiplex single cell profiling of chromatin accessibility by combinatorial cellular indexing”. en. In: Science 348.6237 (May 2015), pp. 910–914. (27) Junyue Cao et al. “Comprehensive single-cell transcriptional profiling of a multicellular organism”. en. In: Science 357.6352 (Aug. 2017), pp. 661–667. (28) Ryan M Mulqueen et al. “Highly scalable generation of DNA methylation profiles in single cells”. en. In: Nat. Biotechnol. 36.5 (June 2018), pp. 428–431. (29) Vijay Ramani et al. “Massively multiplex single-cell Hi-C”. en. In: Nat. Methods 14.3 (Mar. 2017), pp. 263–266. (30) Junyue Cao et al. “The single-cell transcriptional landscape of mammalian organogenesis”. en. In: Nature 566.7745 (Feb. 2019), pp. 496–502. (31) Marlon Stoeckius et al. “Cell Hashing with barcoded antibodies enables multiplexing and doublet detection for single cell genomics”. en. In: Genome Biol. 19.1 (Dec. 2018), p. 224. (32) Jase Gehring et al. “Highly multiplexed single-cell RNA-seq by DNA oligonucleotide tag- ging of cellular proteins”. en. In: Nat. Biotechnol. 38.1 (Jan. 2020), pp. 35–38. 163 (33) Christopher S McGinnis et al. “MULTI-seq: sample multiplexing for single-cell RNA se- quencing using lipid-tagged indices”. en. In: Nat. Methods 16.7 (July 2019), pp. 619–626. (34) Dylan R Farnsworth, Lauren M Saunders, and Adam C Miller. “A single-cell transcriptome atlas for zebrafish development”. en. In: Dev. Biol. 459.2 (Mar. 2020), pp. 100–108. (35) Jonathan S Packer et al. A lineage-resolved molecular atlas of C. elegans embryogenesis at single-cell resolution. 2019. (36) Junyue Cao et al. “A human cell atlas of fetal gene expression”. en. In: Science 370.6518 (Nov. 2020). (37) James A Briggs et al. The dynamics of gene expression in vertebrate embryogenesis at single-cell resolution. 2018. (38) Daniel E Wagner and Allon M Klein. “Lineage tracing meets single-cell omics: opportu- nities and challenges”. en. In: Nat. Rev. Genet. 21.7 (July 2020), pp. 410–427. (39) J R Broach and J Thorner. “High-throughput screening for drug discovery”. en. In: Nature 384.6604 Suppl (Nov. 1996), pp. 14–16. (40) D A Pereira and J A Williams. “Origin and evolution of high throughput screening”. In: Br. J. Pharmacol. 152.1 (2007), pp. 53–61. (41) David Shum et al. “A high density assay format for the detection of novel cytotoxic agents in large chemical libraries”. en. In: J. Enzyme Inhib. Med. Chem. 23.6 (Dec. 2008), pp. 931– 945. (42) Channing Yu et al. “High-throughput identification of genotype-specific cancer vulnerabil- ities in mixtures of barcoded tumor cell lines”. en. In: Nat. Biotechnol. 34.4 (Apr. 2016), pp. 419–423. (43) Zachary E Perlman et al. “Multidimensional drug profiling by automated microscopy”. en. In: Science 306.5699 (Nov. 2004), pp. 1194–1198. (44) Yushi Futamura et al. “Morphobase, an encyclopedic cell morphology database, and its use for drug target identification”. en. In: Chem. Biol. 19.12 (Dec. 2012), pp. 1620–1630. (45) Jungseog Kang et al. “Improving drug discovery with high-content phenotypic screens by systematic selection of reporter cell lines”. en. In: Nat. Biotechnol. 34.1 (Jan. 2016), pp. 70–77. (46) Karen L Huss, Pauline E Blonigen, and Robert M Campbell. “Development of a Tran- screener kinase assay for protein kinase A and demonstration of concordance of data with a filter-binding assay format”. en. In: J. Biomol. Screen. 12.4 (June 2007), pp. 578–584. (47) Chaoyang Ye et al. “DRUG-seq for miniaturized high-throughput transcriptome profiling in drug discovery”. en. In: Nat. Commun. 9.1 (Oct. 2018), p. 4307. (48) Erin C Bush et al. “PLATE-Seq for genome-wide regulatory network analysis of high- throughput screens”. In: Nat. Commun. 8.1 (2017). 164 (49) Aravind Subramanian et al. “A Next Generation Connectivity Map: L1000 Platform and the First 1,000,000 Profiles”. en. In: Cell 171.6 (Nov. 2017), 1437–1452.e17. (50) Justin Lamb et al. “The Connectivity Map: using gene-expression signatures to connect small molecules, genes, and disease”. en. In: Science 313.5795 (Sept. 2006), pp. 1929– 1935. (51) Michael B Elowitz et al. “Stochastic gene expression in a single cell”. en. In: Science 297.5584 (Aug. 2002), pp. 1183–1186. (52) Cole Trapnell. “Defining cell types and states with single-cell genomics”. en. In: Genome Res. 25.10 (Oct. 2015), pp. 1491–1498. (53) Sydney M Shaffer et al. “Rare cell variability and drug-induced reprogramming as a mode of cancer drug resistance”. en. In: Nature 546.7658 (June 2017), pp. 431–435. (54) Sabrina L Spencer et al. “Non-genetic origins of cell-to-cell variability in TRAIL-induced apoptosis”. en. In: Nature 459.7245 (May 2009), pp. 428–432. (55) Marlon Stoeckius et al. “Cell Hashing with barcoded antibodies enables multiplexing and doublet detection for single cell genomics”. en. In: Genome Biol. 19.1 (Dec. 2018), p. 224. (56) Jase Gehring et al. Highly Multiplexed Single-Cell RNA-seq for Defining Cell Population and Transcriptional Spaces. 2018. (57) Christopher S McGinnis et al. MULTI-seq: Scalable sample multiplexing for single-cell RNA sequencing using lipid-tagged indices. 2018. (58) Dongju Shin et al. Multiplexed single-cell RNA-seq via transient barcoding for drug screen- ing. 2018. (59) Junyue Cao et al. “The single-cell transcriptional landscape of mammalian organogenesis”. In: Nature (2019). (60) Matthew A McBrian et al. Histone Acetylation Regulates Intracellular pH. 2013. (61) Sarah A Comerford et al. “Acetate dependence of tumors”. en. In: Cell 159.7 (Dec. 2014), pp. 1591–1602. (62) D A Cusanovich et al. “Multiplex single-cell profiling of chromatin accessibility by com- binatorial cellular indexing”. In: Science 348.6237 (2015), pp. 910–914. (63) Junyue Cao et al. “Comprehensive single-cell transcriptional profiling of a multicellular organism”. en. In: Science 357.6352 (Aug. 2017), pp. 661–667. (64) Leland McInnes and John Healy. “UMAP: Uniform Manifold Approximation and Projec- tion for Dimension Reduction”. In: (Feb. 2018). arXiv: 1802.03426 (stat.ML). (65) Marco Jost et al. “Combined CRISPRi/a-Based Chemical Genetic Screens Reveal that Rigosertib Is a Microtubule-Destabilizing Agent”. en. In: Mol. Cell 68.1 (Oct. 2017), 210– 223.e6. 165 (66) G Grosveld et al. “The chronic myelocytic cell line K562 contains a breakpoint in bcr and produces a chimeric bcr/c-abl transcript”. en. In: Mol. Cell. Biol. 6.2 (Feb. 1986), pp. 607– 616. (67) Emileigh K Greuber et al. “Role of ABL family kinases in cancer: from leukaemia to solid tumours”. en. In: Nat. Rev. Cancer 13.8 (Aug. 2013), pp. 559–571. (68) Jordi Barretina et al. “The Cancer Cell Line Encyclopedia enables predictive modelling of anticancer drug sensitivity”. en. In: Nature 483.7391 (Mar. 2012), pp. 603–607. (69) Chengkai Dai et al. “Loss of tumor suppressor NF1 activates HSF1 to promote carcino- genesis”. en. In: J. Clin. Invest. 122.10 (Oct. 2012), pp. 3742–3754. (70) Laleh Haghverdi et al. “Batch effects in single-cell RNA-sequencing data are corrected by matching mutual nearest neighbors”. en. In: Nat. Biotechnol. 36.5 (June 2018), pp. 421– 427. (71) Cole Trapnell et al. “The dynamics and regulators of cell fate decisions are revealed by pseudotemporal ordering of single cells”. en. In: Nat. Biotechnol. 32.4 (Apr. 2014), pp. 381– 386. (72) William Brazelle et al. “Histone deacetylase inhibitors downregulate checkpoint kinase 1 expression to induce cell death in non-small cell lung cancer cells”. en. In: PLoS One 5.12 (Dec. 2010), e14335. (73) Jae-Seok Roe et al. “BET Bromodomain Inhibition Suppresses the Function of Hematopoi- etic Transcription Factors in Acute Myeloid Leukemia”. en. In: Mol. Cell 58.6 (June 2015), pp. 1028–1039. (74) J E Brownell et al. “Tetrahymena histone acetyltransferase A: a homolog to yeast Gcn5p linking histone acetylation to gene activation”. en. In: Cell 84.6 (Mar. 1996), pp. 843–851. (75) J Taunton, C A Hassig, and S L Schreiber. “A mammalian histone deacetylase related to the yeast transcriptional regulator Rpd3p”. en. In: Science 272.5260 (Apr. 1996), pp. 408– 411. (76) Siavash K Kurdistani. “Chromatin: a capacitor of acetate for integrated regulation of gene expression and cell physiology”. en. In: Curr. Opin. Genet. Dev. 26 (June 2014), pp. 53–58. (77) Kathryn E Wellen et al. “ATP-citrate lyase links cellular metabolism to histone acetyla- tion”. en. In: Science 324.5930 (May 2009), pp. 1076–1080. (78) Aaron R Quinlan and Ira M Hall. “BEDTools: a flexible suite of utilities for comparing genomic features”. en. In: Bioinformatics 26.6 (Mar. 2010), pp. 841–842. (79) Christian Ritz et al. “Dose-Response Analysis Using R”. en. In: PLoS One 10.12 (Dec. 2015), e0146021. (80) Matthew E Ritchie et al. “limma powers differential expression analyses for RNA-sequencing and microarray studies”. en. In: Nucleic Acids Res. 43.7 (Apr. 2015), e47. 166 (81) Leif Va¨remo, Jens Nielsen, and Intawat Nookaew. “Enriching the gene set analysis of genome-wide data by incorporating directionality of gene expression and combining sta- tistical hypotheses and methods”. en. In: Nucleic Acids Res. 41.8 (Apr. 2013), pp. 4378– 4391. (82) Aravind Subramanian et al. “Gene set enrichment analysis: a knowledge-based approach for interpreting genome-wide expression profiles”. en. In: Proc. Natl. Acad. Sci. U. S. A. 102.43 (Oct. 2005), pp. 15545–15550. (83) Itay Tirosh et al. “Dissecting the multicellular ecosystem of metastatic melanoma by single- cell RNA-seq”. en. In: Science 352.6282 (Apr. 2016), pp. 189–196. (84) Junyue Cao et al. “The single-cell transcriptional landscape of mammalian organogenesis”. en. In: Nature 566.7745 (Feb. 2019), pp. 496–502. (85) Jonathan S Packer et al. “A lineage-resolved molecular atlas of embryogenesis at single- cell resolution”. en. In: Science 365.6459 (Sept. 2019). (86) James A Briggs et al. “The dynamics of gene expression in vertebrate embryogenesis at single-cell resolution”. en. In: Science 360.6392 (June 2018). (87) Blanca Pijuan-Sala et al. “A single-cell molecular map of mouse gastrulation and early organogenesis”. en. In: Nature 566.7745 (Feb. 2019), pp. 490–495. (88) Jeffrey A Farrell et al. “Single-cell reconstruction of developmental trajectories during zebrafish embryogenesis”. en. In: Science 360.6392 (June 2018). (89) Nikos Karaiskos et al. “The embryo at single-cell transcriptome resolution”. en. In: Science 358.6360 (Oct. 2017), pp. 194–199. (90) Darren A Cusanovich et al. “The cis-regulatory dynamics of embryonic development at single-cell resolution”. en. In: Nature 555.7697 (Mar. 2018), pp. 538–542. (91) Caleb Weinreb et al. “Lineage tracing on transcriptional landscapes links state to fate dur- ing differentiation”. en. In: Science 367.6479 (Feb. 2020). (92) Xiaojie Qiu et al. “Reversed graph embedding resolves complex single-cell trajectories”. en. In: Nat. Methods 14.10 (Oct. 2017), pp. 979–982. (93) Patrik L Sta˚hl et al. “Visualization and analysis of gene expression in tissue sections by spatial transcriptomics”. en. In: Science 353.6294 (July 2016), pp. 78–82. (94) Samuel G Rodriques et al. “Slide-seq: A scalable technology for measuring genome-wide expression at high spatial resolution”. en. In: Science 363.6434 (Mar. 2019), pp. 1463– 1467. (95) Kok Hao Chen et al. “RNA imaging. Spatially resolved, highly multiplexed RNA profiling in single cells”. en. In: Science 348.6233 (Apr. 2015), aaa6090. (96) Sheel Shah et al. “In Situ Transcription Profiling of Single Cells Reveals Spatial Organi- zation of Cells in the Mouse Hippocampus”. en. In: Neuron 92.2 (Oct. 2016), pp. 342– 357. 167 (97) Je Hyuk Lee et al. “Highly multiplexed subcellular RNA sequencing in situ”. en. In: Sci- ence 343.6177 (Mar. 2014), pp. 1360–1363. (98) Sanjay R Srivatsan et al. “Massively multiplex chemical transcriptomics at single-cell res- olution”. en. In: Science 367.6473 (Jan. 2020), pp. 45–51. (99) Mary C Regier et al. “Spatial presentation of biological molecules to cells by localized diffusive transfer”. en. In: Lab Chip 19.12 (June 2019), pp. 2114–2126. (100) Samuel L Wolock, Romain Lopez, and Allon M Klein. “Scrublet: Computational Identi- fication of Cell Doublets in Single-Cell Transcriptomic Data”. en. In: Cell Syst 8.4 (Apr. 2019), 281–291.e9. (101) Laleh Haghverdi et al. “Batch effects in single-cell RNA-sequencing data are corrected by matching mutual nearest neighbors”. en. In: Nat. Biotechnol. 36.5 (June 2018), pp. 421– 427. (102) Tim Stuart et al. “Comprehensive Integration of Single-Cell Data”. en. In: Cell 177.7 (June 2019), 1888–1902.e21. (103) Gioele La Manno et al. “Molecular architecture of the developing mouse brain”. en. July 2020. (104) Hannah A Pliner, Jay Shendure, and Cole Trapnell. “Supervised classification enables rapid annotation of cell atlases”. en. In: Nat. Methods 16.10 (Oct. 2019), pp. 983–986. (105) Tobias Pietzsch et al. “BigDataViewer: visualization and processing for large image data sets”. en. In: Nat. Methods 12.6 (June 2015), pp. 481–483. (106) Johannes Schindelin et al. “Fiji: an open-source platform for biological-image analysis”. en. In: Nat. Methods 9.7 (June 2012), pp. 676–682. (107) P The´venaz, U E Ruttimann, and M Unser. “A pyramid approach to subpixel registration based on intensity”. en. In: IEEE Trans. Image Process. 7.1 (1998), pp. 27–41. (108) Ruben Dries et al. Giotto, a toolbox for integrative analysis and visualization of spatial expression data. (109) Graciana Diez-Roux et al. “A high-resolution anatomical atlas of the transcriptome in the mouse embryo”. en. In: PLoS Biol. 9.1 (Jan. 2011), e1000582. (110) David Fawkner-Corbett et al. “Spatiotemporal analysis of human intestinal development at single-cell resolution”. en. In: Cell (Jan. 2021). (111) Andreas Sagner and James Briscoe. Establishing neuronal diversity in the spinal cord: a time and a place. 2019. (112) V Dupe´ and A Lumsden. “Hindbrain patterning involves graded responses to retinoic acid signalling”. en. In: Development 128.12 (June 2001), pp. 2199–2208. (113) Suzan Abu-Abed et al. “Differential expression of the retinoic acid-metabolizing enzymes CYP26A1 and CYP26B1 during murine organogenesis”. en. In: Mech. Dev. 110.1-2 (Jan. 2002), pp. 173–177. 168 (114) Kirsten M Spoorendonk et al. “Retinoic acid and Cyp26b1 are critical regulators of osteo- genesis in the axial skeleton”. en. In: Development 135.22 (Nov. 2008), pp. 3765–3774. (115) Vladimir Yu Kiselev, Andrew Yiu, and Martin Hemberg. “scmap: projection of single-cell RNA-seq data across data sets”. en. In: Nat. Methods 15.5 (May 2018), pp. 359–362. (116) B Nadarajah et al. “Two modes of radial migration in early development of the cerebral cortex”. en. In: Nat. Neurosci. 4.2 (Feb. 2001), pp. 143–150. (117) O Marın and J L Rubenstein. “A long, remarkable journey: tangential migration in the telencephalon”. en. In: Nat. Rev. Neurosci. 2.11 (Nov. 2001), pp. 780–790. (118) Seong-Seng Tan et al. “Cellular dispersion patterns and phenotypes in the developing mouse superior colliculus”. en. In: Dev. Biol. 241.1 (Jan. 2002), pp. 117–131. (119) Yuji Watanabe, Chie Sakuma, and Hiroyuki Yaginuma. “Dispersing movement of tangen- tial neuronal migration in superficial layers of the developing chick optic tectum”. en. In: Dev. Biol. 437.2 (May 2018), pp. 131–139. (120) Junyue Cao et al. “Comprehensive single-cell transcriptional profiling of a multicellular organism”. en. In: Science 357.6352 (Aug. 2017), pp. 661–667. (121) Darren A Cusanovich et al. “Multiplex single cell profiling of chromatin accessibility by combinatorial cellular indexing”. en. In: Science 348.6237 (May 2015), pp. 910–914. (122) Matthew H Kaufman. The Atlas of Mouse Development. en. Academic Press, Sept. 1992. (123) A E Petiet et al. High-resolution magnetic resonance histology of the embryonic and neonatal mouse: A 4D atlas and morphologic database. 2008. (124) Julien Delile et al. “Single cell transcriptomics reveals spatial and temporal dynamics of gene expression in the developing mouse spinal cord”. en. In: Development 146.12 (Mar. 2019). (125) Andrew W Senior et al. “Improved protein structure prediction using potentials from deep learning”. en. In: Nature 577.7792 (Jan. 2020), pp. 706–710. (126) Konrad J Karczewski et al. “Author Correction: The mutational constraint spectrum quan- tified from variation in 141,456 humans”. en. In: Nature 590.7846 (Feb. 2021), E53. (127) Clare Bycroft et al. “The UK Biobank resource with deep phenotyping and genomic data”. en. In: Nature 562.7726 (Oct. 2018), pp. 203–209. (128) All of Us Research Program Investigators et al. “The “All of Us” Research Program”. en. In: N. Engl. J. Med. 381.7 (Aug. 2019), pp. 668–676. (129) Tim Stuart et al. Comprehensive Integration of Single-Cell Data. 2019. (130) Laleh Haghverdi et al. “Batch effects in single-cell RNA-sequencing data are corrected by matching mutual nearest neighbors”. en. In: Nat. Biotechnol. 36.5 (June 2018), pp. 421– 427. (131) Yuhan Hao et al. “Integrated analysis of multimodal single-cell data”. en. Oct. 2020. 169 (132) Marlon Stoeckius et al. “Simultaneous epitope and transcriptome measurement in single cells”. en. In: Nat. Methods 14.9 (Sept. 2017), pp. 865–868. 170 171 VITA Sanjay R Srivatsan grew up in the suburbs of Atlanta, Georgia. There he attended Evansdale Elementary, Trickum Middle School and Parkview high school. Afterwards, he matriculated to the University of California, Berkeley where he graduated with a B.S. in Bioengineering from the School of Engineering. During the summers of 2010 and 2011, and during the fall semester of 2011, Sanjay spent his time at Emory University performing research under the mentorship of Dr.Periasamy Selvaraj. There he investigated methods for creating immune-stimulating micro- particles derived from tumor cells with the goal of creating a prophylactic cancer vaccine. Before starting graduate school, Sanjay discovered his love for structural biology in the lab of Peter Kwong at the Vaccine Research Center in the National Institute for Allergy and Infectious Disease. There he worked with Tongqing Zhou to solve structures of broadly neutralizing antibodies specific to the HIV-1’s gp120 protein. These structures were in turn used to inform the design new vaccine immunogens to elicit a HIV-1 specific immune response. When not in lab, Sanjay spends his time dreaming of rock climbing, listening to music, dreaming up new experiments, and making pizza.