Data Driven Methods for Scaffolding Genomes with Hi-C

Loading...
Thumbnail Image

Authors

Sur, Aakash

Journal Title

Journal ISSN

Volume Title

Publisher

Abstract

High-quality reference genomes are once again in vogue with the publication of the telomere-to-telomere human genome and several challenging plant and animal genomes. Recent efforts in genome assembly have coalesced around two key technologies – ultra-long reads and genome chromatin conformation capture (Hi-C). Here, we used both to complete the protist genomes of Leishmania donovani, Leishmania tarentolae, Crithidia fasciculata, and Euglena gracilis, shedding light on their genomic organization and evolutionary history. To navigate the many Hi-C genome scaffolding methods, we benchmarked the most popular methods against a set of high-quality reference genomes. We found that while most can operate well under ideal circumstances, many struggle with using modern high-quality assemblies which contain near chromosome length contigs. Finally, we attempted to overcome these limitations using a machine learning approach by leveraging the recent bounty of genomes that have been published with Hi-C. Using an innovative convolutional neural network, we demonstrated a proof of concept for a data-driven approach to scaffolding genomes.

Description

Thesis (Ph.D.)--University of Washington, 2022

Citation

DOI