Multi-tenant Machine Learning Model Serving Systems on GPU Clusters

Chen, Lequn

Multi-tenant Machine Learning Model Serving Systems on GPU Clusters

dc.contributor.advisor	Krishnamurthy, Arvind
dc.contributor.author	Chen, Lequn
dc.date.accessioned	2024-04-26T23:19:29Z
dc.date.available	2024-04-26T23:19:29Z
dc.date.issued	2024-04-26
dc.date.submitted	2024
dc.description	Thesis (Ph.D.)--University of Washington, 2024
dc.description.abstract	In an era where GPUs are both costly and scarce, efficiently serving machine learning models has become a critical challenge. Assuming that serving one model requires $k$ GPUs, serving n models would seemingly require $kn$ GPUs. In the multi-tenant setting, we can pool the whole cluster's GPUs to serve the $n$ models collectively, thus requiring far fewer GPUs. This talk addresses how to optimize cluster-wide GPU utilization in a multi-tenant setting. Key challenges addressed include:(1) batching efficiency under latency constraints, (2) bursty requests and GPU consolidation, (3) GPU cluster auto-scaling. This dissertation discusses two projects that address the above research problems.The first project, Symphony, focuses on serving DNN models. With a novel Deferred Batch Scheduling algorithm and a system design supporting it, Symphony makes high-quality batching decisions and enables robust auto-scaling. Symphony achieves 6x goodput given the same number of GPUs, saves 60\% GPUs when serving the same request rate, and is capable to handle 15 million requests per second. The second project, Punica, creates a new paradigm of serving multiple LoRA fine-tuned large language models at the cost of one. Punica improves throughput by 12x without latency sacrifice.
dc.embargo.terms	Open Access
dc.format.mimetype	application/pdf
dc.identifier.other	Chen_washington_0250E_26603.pdf
dc.identifier.uri	http://hdl.handle.net/1773/51337
dc.language.iso	en_US
dc.rights	none
dc.subject	Inference
dc.subject	Large Language Model
dc.subject	Model Serving
dc.subject	Computer science
dc.subject.other	Computer science and engineering
dc.title	Multi-tenant Machine Learning Model Serving Systems on GPU Clusters
dc.type	Thesis

Files

Original bundle

Now showing 1 - 1 of 1

Name:: Chen_washington_0250E_26603.pdf
Size:: 1.53 MB
Format:: Adobe Portable Document Format

Download

Collections

Computer science and engineering