Large-Scale B Cell Receptor Sequence Analysis Using Phylogenetics and Machine Learning
MetadataShow full item record
The adaptive immune system synthesizes antibodies, the soluble form of B cell receptors (BCRs), to bind to and neutralize pathogens that enter our body. B cells are able to generate a diverse set of high affinity antibodies through the affinity maturation process. During maturation, ``naive'' BCR sequences first accumulate mutations according to a neutral evolutionary process called somatic hypermutation (SHM), which may modify the associated binding affinities, and then are subject to natural selection by clonal expansion, which promotes the higher affinity antibodies. The set of mutated BCRs that result from a single naive BCR undergoing SHM can be referred to as a ``clonal family''. In my thesis, I study the mechanisms that govern the aforementioned evolutionary and selective processes of BCR sequences with the goal of better understanding how naive B cells diversify into mature B cells with high binding affinities. It is frequently important to infer the full evolutionary paths from a given naive BCR sequence to the corresponding mature BCR sequences in the clonal family. Stochastic mapping, a missing data imputation technique, can be used to estimate the mutational trajectories mentioned above; it is a simulation-based method for probabilistically mapping substitution histories onto phylogenies according to continuous-time Markov models of evolution. Current simulation-free algorithms can compute the mean but not any higher-order moments of the number of substitutions or of other stochastic mapping summaries; these algorithms scale linearly in the number of tips of the phylogenetic tree. I present the first simulation-free dynamic programming algorithm that calculates prior and posterior mapping variances and scales linearly in the number of phylogeny tips. This procedure suggests a general framework that can be used to efficiently compute higher-order moments of stochastic mapping summaries without simulations. Before one can perform clonal lineage or ancestral sequence inference in a clonal family, one must first obtain an estimate of the clonal phylogenetic tree. Currently, standard phylogenetic inference techniques are used to model the SHM process; however, these methods do not account for all the complexities associated with this mutation process. I introduce a novel approach to inference that is based on a phylogenetic hidden Markov model (phylo-HMM). This technique is not only based on a more biologically realistic model of evolution but also designed to scale to the large datasets that result from high-throughput sequencing. In the antibody engineering field, researchers would like to infer the most likely per-site substitutions that are allowed in a clonal family. Unfortunately, many clonal families are small in size and do not have enough observed sequence information to accurately answer the preceding question. Despite this, there are structural properties associated with BCR sequences that are common across clonal families. I propose a penalized regression model that leverages aggregated amino acid count data (also known as ``substitution profiles'') in large clonal families to predict the substitution profiles in smaller clonal families. I show that there is information, possibly embedded through structural and functional constraints, contained within these large clonal families that can be shared with the smaller ones to enhance their substitution profile predictions. It is important to note that this regularized model assumes independence across sites, which is not a realistic assumption, so I consider extensions to models that account for coevolving sites.
- Statistics