Distributed Task Scheduling on Cloud Infrastructure for Bioinformatics Workflows and Analysis

Morrow, Rick

Distributed Task Scheduling on Cloud Infrastructure for Bioinformatics Workflows and Analysis

Files

Morrow_washington_0250O_25251.pdf (1.03 MB)

Date

2023-04-17

relationships.isAuthorOf

Morrow, Rick

Abstract

The datasets analyzed in bioinformatics are large and numerous, requiring complex analysis. The size and complexity of bioinformatics data often makes it impractical for researchers to run analytical workflows on their personal laptops or PCs. Bioinformatics jobs, however, can benefit from greatly reduced compute times when parallelized across multiple CPUs or cores. Historically, researchers would run analytical workflows on local or static HPC clusters using batch schedulers like Slurm. Running jobs on HPC clusters can be complicated, as jobs must be created and using scheduler specific scripts. HPC clusters are also typically a shared resource with fixed scalability and the main purpose of schedulers was to queue requests and provide resources when available . Cloud computing presents a cost-effective, scalable, reproducible and on-demand resource for researchers to extend the computing resources available to them without the overhead of acquiring and maintaining on-site infrastructure. Although the cloud provides elastic and scalable resources, there remains the challenge of effectively utilizing cloud computing resources through efficient task scheduling. Specifically, there needs to be some method of queueing and scheduling tasks, and assigning those tasks to available workers. In this project we build a task scheduler that handles bioinformatics workflows, which are split into atomic tasks that can be run in parallel, distributed across an arbitrary mix of computing resources from local machines to cloud resources of arbitrary size. The tasks themselves will be processed by containerized workers that implement a standardized and reproducible execution environment. We assess the performance of our scheduler using a real-world bioinformatics task which aligns a set of short sequences (reads) to a human reference sequence using the Burrows-Wheeler Aligner (BWA). We benchmark and compare our scheduling methodology against sequential processing of bioinformatics data using a Dockerized Burrows-Wheeler Aligner (BWA) and against a script that processes this data in parallel without containerization.