Multi-tenant Machine Learning Model Serving Systems on GPU Clusters
| dc.contributor.advisor | Krishnamurthy, Arvind | |
| dc.contributor.author | Chen, Lequn | |
| dc.date.accessioned | 2024-04-26T23:19:29Z | |
| dc.date.available | 2024-04-26T23:19:29Z | |
| dc.date.issued | 2024-04-26 | |
| dc.date.submitted | 2024 | |
| dc.description | Thesis (Ph.D.)--University of Washington, 2024 | |
| dc.description.abstract | In an era where GPUs are both costly and scarce, efficiently serving machine learning models has become a critical challenge. Assuming that serving one model requires $k$ GPUs, serving n models would seemingly require $kn$ GPUs. In the multi-tenant setting, we can pool the whole cluster's GPUs to serve the $n$ models collectively, thus requiring far fewer GPUs. This talk addresses how to optimize cluster-wide GPU utilization in a multi-tenant setting. Key challenges addressed include:(1) batching efficiency under latency constraints, (2) bursty requests and GPU consolidation, (3) GPU cluster auto-scaling. This dissertation discusses two projects that address the above research problems.The first project, Symphony, focuses on serving DNN models. With a novel Deferred Batch Scheduling algorithm and a system design supporting it, Symphony makes high-quality batching decisions and enables robust auto-scaling. Symphony achieves 6x goodput given the same number of GPUs, saves 60\% GPUs when serving the same request rate, and is capable to handle 15 million requests per second. The second project, Punica, creates a new paradigm of serving multiple LoRA fine-tuned large language models at the cost of one. Punica improves throughput by 12x without latency sacrifice. | |
| dc.embargo.terms | Open Access | |
| dc.format.mimetype | application/pdf | |
| dc.identifier.other | Chen_washington_0250E_26603.pdf | |
| dc.identifier.uri | http://hdl.handle.net/1773/51337 | |
| dc.language.iso | en_US | |
| dc.rights | none | |
| dc.subject | Inference | |
| dc.subject | Large Language Model | |
| dc.subject | Model Serving | |
| dc.subject | Computer science | |
| dc.subject.other | Computer science and engineering | |
| dc.title | Multi-tenant Machine Learning Model Serving Systems on GPU Clusters | |
| dc.type | Thesis |
Files
Original bundle
1 - 1 of 1
Loading...
- Name:
- Chen_washington_0250E_26603.pdf
- Size:
- 1.53 MB
- Format:
- Adobe Portable Document Format
