Distributed Task Scheduling on Cloud Infrastructure for Bioinformatics Workflows and Analysis

dc.contributor.advisorHung, Ling-Hong
dc.contributor.authorMorrow, Rick
dc.date.accessioned2023-04-17T18:01:49Z
dc.date.available2023-04-17T18:01:49Z
dc.date.issued2023-04-17
dc.date.submitted2023
dc.descriptionThesis (Master's)--University of Washington, 2023
dc.description.abstractThe datasets analyzed in bioinformatics are large and numerous, requiring complex analysis. The size and complexity of bioinformatics data often makes it impractical for researchers to run analytical workflows on their personal laptops or PCs. Bioinformatics jobs, however, can benefit from greatly reduced compute times when parallelized across multiple CPUs or cores. Historically, researchers would run analytical workflows on local or static HPC clusters using batch schedulers like Slurm. Running jobs on HPC clusters can be complicated, as jobs must be created and using scheduler specific scripts. HPC clusters are also typically a shared resource with fixed scalability and the main purpose of schedulers was to queue requests and provide resources when available . Cloud computing presents a cost-effective, scalable, reproducible and on-demand resource for researchers to extend the computing resources available to them without the overhead of acquiring and maintaining on-site infrastructure. Although the cloud provides elastic and scalable resources, there remains the challenge of effectively utilizing cloud computing resources through efficient task scheduling. Specifically, there needs to be some method of queueing and scheduling tasks, and assigning those tasks to available workers. In this project we build a task scheduler that handles bioinformatics workflows, which are split into atomic tasks that can be run in parallel, distributed across an arbitrary mix of computing resources from local machines to cloud resources of arbitrary size. The tasks themselves will be processed by containerized workers that implement a standardized and reproducible execution environment. We assess the performance of our scheduler using a real-world bioinformatics task which aligns a set of short sequences (reads) to a human reference sequence using the Burrows-Wheeler Aligner (BWA). We benchmark and compare our scheduling methodology against sequential processing of bioinformatics data using a Dockerized Burrows-Wheeler Aligner (BWA) and against a script that processes this data in parallel without containerization.
dc.embargo.termsOpen Access
dc.format.mimetypeapplication/pdf
dc.identifier.otherMorrow_washington_0250O_25251.pdf
dc.identifier.urihttp://hdl.handle.net/1773/49832
dc.language.isoen_US
dc.rightsCC BY
dc.subjectBioinformatics
dc.subjectCloud Computing
dc.subjectParallel Processing
dc.subjectTask Scheduler
dc.subjectComputer science
dc.subjectBioinformatics
dc.subject.other
dc.titleDistributed Task Scheduling on Cloud Infrastructure for Bioinformatics Workflows and Analysis
dc.typeThesis

Files

Original bundle

Now showing 1 - 1 of 1
Loading...
Thumbnail Image
Name:
Morrow_washington_0250O_25251.pdf
Size:
1.03 MB
Format:
Adobe Portable Document Format