Structural Insights for LLM Serving Efficiency
| dc.contributor.advisor | Krishnamurthy, Arvind | |
| dc.contributor.author | Patel, Pratyush | |
| dc.date.accessioned | 2025-10-02T16:07:28Z | |
| dc.date.issued | 2025-10-02 | |
| dc.date.submitted | 2025 | |
| dc.description | Thesis (Ph.D.)--University of Washington, 2025 | |
| dc.description.abstract | The widespread adoption of Large Language Models (LLMs) has reshaped the datacenter computing landscape. As these models continue to grow in size and complexity, they require increasingly expensive and power-intensive infrastructure. Hence, serving LLMs efficiently has become critical for managing costs and resource constraints in modern datacenters. In this dissertation, I argue that serving efficiency can be significantly improved by designing systems that are aware of the distinct phases of generative LLM inference: a compute-intensive prefill phase and a memory-intensive decode phase. Exploiting the unique properties of these phases unlocks significant performance gains at scale. My research validates this thesis through three studies. First, I address power constraints, a key bottleneck to datacenter growth. By analyzing how the distinct power demands of prefill and decode phases aggregate, I show that inference cluster power is underutilized. Based on this observation, I develop a power oversubscription framework that safely adds more servers under existing power budgets, increasing inference cluster capacity with minimal performance impact. Second, I show that running the compute-bound prefill and memory-bound decode phases on the same hardware leads to poor performance and resource stranding. To address these overheads, I introduce a new inference cluster architecture that disaggregates the phases onto hardware fleets specialized to better manage resources for each phase. This phase-separated cluster design yields substantial efficiency improvements over traditional approaches. Third, I extensively analyze the unique inefficiencies caused by conditional computation in Mixture-of-Experts (MoE) models, which I formalize as the MoE tax. This tax manifests differently across the two phases, for instance, creating load imbalance in prefill and increased memory transfers in decode. Based on this analysis, I propose phase-specific optimizations to address these bottlenecks and improve the efficiency of serving MoE models at scale. Collectively, these studies demonstrate that phase awareness is a key principle for designing efficient generative LLM serving systems. | |
| dc.embargo.lift | 2027-09-22T16:07:28Z | |
| dc.embargo.terms | Restrict to UW for 2 years -- then make Open Access | |
| dc.format.mimetype | application/pdf | |
| dc.identifier.other | Patel_washington_0250E_28902.pdf | |
| dc.identifier.uri | https://hdl.handle.net/1773/53972 | |
| dc.language.iso | en_US | |
| dc.rights | none | |
| dc.subject | Computer architecture | |
| dc.subject | Computer systems | |
| dc.subject | Large language models | |
| dc.subject | Computer science | |
| dc.subject | Computer engineering | |
| dc.subject.other | Computer science and engineering | |
| dc.title | Structural Insights for LLM Serving Efficiency | |
| dc.type | Thesis |
