Multi-tenant Machine Learning Model Serving Systems on GPU Clusters

Loading...
Thumbnail Image

Authors

Chen, Lequn

Journal Title

Journal ISSN

Volume Title

Publisher

Abstract

In an era where GPUs are both costly and scarce, efficiently serving machine learning models has become a critical challenge. Assuming that serving one model requires $k$ GPUs, serving n models would seemingly require $kn$ GPUs. In the multi-tenant setting, we can pool the whole cluster's GPUs to serve the $n$ models collectively, thus requiring far fewer GPUs. This talk addresses how to optimize cluster-wide GPU utilization in a multi-tenant setting. Key challenges addressed include:(1) batching efficiency under latency constraints, (2) bursty requests and GPU consolidation, (3) GPU cluster auto-scaling. This dissertation discusses two projects that address the above research problems.The first project, Symphony, focuses on serving DNN models. With a novel Deferred Batch Scheduling algorithm and a system design supporting it, Symphony makes high-quality batching decisions and enables robust auto-scaling. Symphony achieves 6x goodput given the same number of GPUs, saves 60\% GPUs when serving the same request rate, and is capable to handle 15 million requests per second. The second project, Punica, creates a new paradigm of serving multiple LoRA fine-tuned large language models at the cost of one. Punica improves throughput by 12x without latency sacrifice.

Description

Thesis (Ph.D.)--University of Washington, 2024

Citation

DOI