Structural Insights for LLM Serving Efficiency

Patel, Pratyush

Structural Insights for LLM Serving Efficiency

dc.contributor.advisor	Krishnamurthy, Arvind
dc.contributor.author	Patel, Pratyush
dc.date.accessioned	2025-10-02T16:07:28Z
dc.date.issued	2025-10-02
dc.date.submitted	2025
dc.description	Thesis (Ph.D.)--University of Washington, 2025
dc.description.abstract	The widespread adoption of Large Language Models (LLMs) has reshaped the datacenter computing landscape. As these models continue to grow in size and complexity, they require increasingly expensive and power-intensive infrastructure. Hence, serving LLMs efficiently has become critical for managing costs and resource constraints in modern datacenters. In this dissertation, I argue that serving efficiency can be significantly improved by designing systems that are aware of the distinct phases of generative LLM inference: a compute-intensive prefill phase and a memory-intensive decode phase. Exploiting the unique properties of these phases unlocks significant performance gains at scale. My research validates this thesis through three studies. First, I address power constraints, a key bottleneck to datacenter growth. By analyzing how the distinct power demands of prefill and decode phases aggregate, I show that inference cluster power is underutilized. Based on this observation, I develop a power oversubscription framework that safely adds more servers under existing power budgets, increasing inference cluster capacity with minimal performance impact. Second, I show that running the compute-bound prefill and memory-bound decode phases on the same hardware leads to poor performance and resource stranding. To address these overheads, I introduce a new inference cluster architecture that disaggregates the phases onto hardware fleets specialized to better manage resources for each phase. This phase-separated cluster design yields substantial efficiency improvements over traditional approaches. Third, I extensively analyze the unique inefficiencies caused by conditional computation in Mixture-of-Experts (MoE) models, which I formalize as the MoE tax. This tax manifests differently across the two phases, for instance, creating load imbalance in prefill and increased memory transfers in decode. Based on this analysis, I propose phase-specific optimizations to address these bottlenecks and improve the efficiency of serving MoE models at scale. Collectively, these studies demonstrate that phase awareness is a key principle for designing efficient generative LLM serving systems.
dc.embargo.lift	2027-09-22T16:07:28Z
dc.embargo.terms	Restrict to UW for 2 years -- then make Open Access
dc.format.mimetype	application/pdf
dc.identifier.other	Patel_washington_0250E_28902.pdf
dc.identifier.uri	https://hdl.handle.net/1773/53972
dc.language.iso	en_US
dc.rights	none
dc.subject	Computer architecture
dc.subject	Computer systems
dc.subject	Large language models
dc.subject	Computer science
dc.subject	Computer engineering
dc.subject.other	Computer science and engineering
dc.title	Structural Insights for LLM Serving Efficiency
dc.type	Thesis

Collections

Computer science and engineering

Structural Insights for LLM Serving Efficiency

Files

Collections