Show simple item record

dc.contributor.advisorMinin, Volodymyr
dc.contributor.advisorMatsen, Frederick
dc.contributor.authorDhar, Amrit
dc.date.accessioned2019-08-14T22:39:17Z
dc.date.submitted2019
dc.identifier.otherDhar_washington_0250E_20048.pdf
dc.identifier.urihttp://hdl.handle.net/1773/44457
dc.descriptionThesis (Ph.D.)--University of Washington, 2019
dc.description.abstractThe adaptive immune system synthesizes antibodies, the soluble form of B cell receptors (BCRs), to bind to and neutralize pathogens that enter our body. B cells are able to generate a diverse set of high affinity antibodies through the affinity maturation process. During maturation, ``naive'' BCR sequences first accumulate mutations according to a neutral evolutionary process called somatic hypermutation (SHM), which may modify the associated binding affinities, and then are subject to natural selection by clonal expansion, which promotes the higher affinity antibodies. The set of mutated BCRs that result from a single naive BCR undergoing SHM can be referred to as a ``clonal family''. In my thesis, I study the mechanisms that govern the aforementioned evolutionary and selective processes of BCR sequences with the goal of better understanding how naive B cells diversify into mature B cells with high binding affinities. It is frequently important to infer the full evolutionary paths from a given naive BCR sequence to the corresponding mature BCR sequences in the clonal family. Stochastic mapping, a missing data imputation technique, can be used to estimate the mutational trajectories mentioned above; it is a simulation-based method for probabilistically mapping substitution histories onto phylogenies according to continuous-time Markov models of evolution. Current simulation-free algorithms can compute the mean but not any higher-order moments of the number of substitutions or of other stochastic mapping summaries; these algorithms scale linearly in the number of tips of the phylogenetic tree. I present the first simulation-free dynamic programming algorithm that calculates prior and posterior mapping variances and scales linearly in the number of phylogeny tips. This procedure suggests a general framework that can be used to efficiently compute higher-order moments of stochastic mapping summaries without simulations. Before one can perform clonal lineage or ancestral sequence inference in a clonal family, one must first obtain an estimate of the clonal phylogenetic tree. Currently, standard phylogenetic inference techniques are used to model the SHM process; however, these methods do not account for all the complexities associated with this mutation process. I introduce a novel approach to inference that is based on a phylogenetic hidden Markov model (phylo-HMM). This technique is not only based on a more biologically realistic model of evolution but also designed to scale to the large datasets that result from high-throughput sequencing. In the antibody engineering field, researchers would like to infer the most likely per-site substitutions that are allowed in a clonal family. Unfortunately, many clonal families are small in size and do not have enough observed sequence information to accurately answer the preceding question. Despite this, there are structural properties associated with BCR sequences that are common across clonal families. I propose a penalized regression model that leverages aggregated amino acid count data (also known as ``substitution profiles'') in large clonal families to predict the substitution profiles in smaller clonal families. I show that there is information, possibly embedded through structural and functional constraints, contained within these large clonal families that can be shared with the smaller ones to enhance their substitution profile predictions. It is important to note that this regularized model assumes independence across sites, which is not a realistic assumption, so I consider extensions to models that account for coevolving sites.
dc.format.mimetypeapplication/pdf
dc.language.isoen_US
dc.rightsCC BY
dc.subjectBayesian statistics
dc.subjectevolution
dc.subjectimmunology
dc.subjectphylogenetics
dc.subjectStatistics
dc.subjectEvolution & development
dc.subjectImmunology
dc.subject.otherStatistics
dc.titleLarge-Scale B Cell Receptor Sequence Analysis Using Phylogenetics and Machine Learning
dc.typeThesis
dc.embargo.termsDelay release for 1 year -- then make Open Access
dc.embargo.lift2020-08-13T22:39:17Z


Files in this item

Thumbnail

This item appears in the following Collection(s)

Show simple item record