Scalable and cloud-enabled analysis of long read sequencing data

Reddy, Shishir

Scalable and cloud-enabled analysis of long read sequencing data

Files

Reddy_washington_0250O_25074.pdf (1.78 MB)

Date

2023-01-21

relationships.isAuthorOf

Reddy, Shishir

Abstract

Long-read sequencing has great promise in enabling portable, rapid molecular-assisted diagnoses. Applications of long-read sequencing include improved prognosis of critically ill patients through variant detection along with rapid genetic diagnoses. A key challenge in democratizing long-read sequencing technology in the biomedical and clinical community is the lack of graphical bioinformatics software tools which can efficiently process the raw data, support graphical output and interactive visualizations for interpretations of results. Another obstacle is that high performance software tools for long-read sequencing data analyses often leverage graphics processing units (GPU), which is challenging and time-consuming to configure, especially on the cloud. Many solutions can be explored in long-read sequencing including the addition of graphical bioinformatics software tools, hardware acceleration such as Graphics Processing Units (GPUs), or optimization with Tensor Processing Units (TPUs). Long-read sequencing workflows for diagnosis involve several steps that can be hardware-accelerated and optimized using various processing methods. Optimizing long-read sequencing workflows through hardware-acceleration can reduce turnaround times of diagnoses from days to hours. Our goal is to create and optimize long-read sequencing workflows to build rapid, cost-effective solutions for cancer detection and diagnosis on the cloud. This thesis introduces two containerized, hardware-accelerated long-read sequencing analysis workflows for fusion analysis and variant-calling. The fusion analysis workflow introduces a fusion finding tool -- the Biodepot Fusion Finder (BFF) -- capable of rapidly detecting fusions and calculating sample enrichment. This fusion workflow is benchmarked for accuracy and compared to the fusion finding software LongGF on cell-line and patient samples of nanopore data. The variant-calling workflow uses PEPPER-Margin-Deepvariant to call structural variants in a cloud-based GPU-enabled environment. This workflow is benchmarked for accuracy between GPU and CPU versions of the variant-calling software for better visibility in which specific stages of the pipeline benefit from hardware acceleration.