Multi-tenant Machine Learning Model Serving Systems on GPU Clusters

dc.contributor.advisorKrishnamurthy, Arvind
dc.contributor.authorChen, Lequn
dc.date.accessioned2024-04-26T23:19:29Z
dc.date.available2024-04-26T23:19:29Z
dc.date.issued2024-04-26
dc.date.submitted2024
dc.descriptionThesis (Ph.D.)--University of Washington, 2024
dc.description.abstractIn an era where GPUs are both costly and scarce, efficiently serving machine learning models has become a critical challenge. Assuming that serving one model requires $k$ GPUs, serving n models would seemingly require $kn$ GPUs. In the multi-tenant setting, we can pool the whole cluster's GPUs to serve the $n$ models collectively, thus requiring far fewer GPUs. This talk addresses how to optimize cluster-wide GPU utilization in a multi-tenant setting. Key challenges addressed include:(1) batching efficiency under latency constraints, (2) bursty requests and GPU consolidation, (3) GPU cluster auto-scaling. This dissertation discusses two projects that address the above research problems.The first project, Symphony, focuses on serving DNN models. With a novel Deferred Batch Scheduling algorithm and a system design supporting it, Symphony makes high-quality batching decisions and enables robust auto-scaling. Symphony achieves 6x goodput given the same number of GPUs, saves 60\% GPUs when serving the same request rate, and is capable to handle 15 million requests per second. The second project, Punica, creates a new paradigm of serving multiple LoRA fine-tuned large language models at the cost of one. Punica improves throughput by 12x without latency sacrifice.
dc.embargo.termsOpen Access
dc.format.mimetypeapplication/pdf
dc.identifier.otherChen_washington_0250E_26603.pdf
dc.identifier.urihttp://hdl.handle.net/1773/51337
dc.language.isoen_US
dc.rightsnone
dc.subjectInference
dc.subjectLarge Language Model
dc.subjectModel Serving
dc.subjectComputer science
dc.subject.otherComputer science and engineering
dc.titleMulti-tenant Machine Learning Model Serving Systems on GPU Clusters
dc.typeThesis

Files

Original bundle

Now showing 1 - 1 of 1
Loading...
Thumbnail Image
Name:
Chen_washington_0250E_26603.pdf
Size:
1.53 MB
Format:
Adobe Portable Document Format