Krishnamurthy, ArvindChen, Lequn2024-04-262024-04-262024-04-262024Chen_washington_0250E_26603.pdfhttp://hdl.handle.net/1773/51337Thesis (Ph.D.)--University of Washington, 2024In an era where GPUs are both costly and scarce, efficiently serving machine learning models has become a critical challenge. Assuming that serving one model requires $k$ GPUs, serving n models would seemingly require $kn$ GPUs. In the multi-tenant setting, we can pool the whole cluster's GPUs to serve the $n$ models collectively, thus requiring far fewer GPUs. This talk addresses how to optimize cluster-wide GPU utilization in a multi-tenant setting. Key challenges addressed include:(1) batching efficiency under latency constraints, (2) bursty requests and GPU consolidation, (3) GPU cluster auto-scaling. This dissertation discusses two projects that address the above research problems.The first project, Symphony, focuses on serving DNN models. With a novel Deferred Batch Scheduling algorithm and a system design supporting it, Symphony makes high-quality batching decisions and enables robust auto-scaling. Symphony achieves 6x goodput given the same number of GPUs, saves 60\% GPUs when serving the same request rate, and is capable to handle 15 million requests per second. The second project, Punica, creates a new paradigm of serving multiple LoRA fine-tuned large language models at the cost of one. Punica improves throughput by 12x without latency sacrifice.application/pdfen-USnoneInferenceLarge Language ModelModel ServingComputer scienceComputer science and engineeringMulti-tenant Machine Learning Model Serving Systems on GPU ClustersThesis