Compiler and Runtime Systems for Generative AI Models

dc.contributor.advisorCeze, Luis
dc.contributor.authorYe, Zihao
dc.date.accessioned2025-10-02T16:07:17Z
dc.date.available2025-10-02T16:07:17Z
dc.date.issued2025-10-02
dc.date.submitted2025
dc.descriptionThesis (Ph.D.)--University of Washington, 2025
dc.description.abstractGenerative AI (GenAI) workloads have rapidly become the predominant data center GPU workload. However, designing efficient GPU kernels for GenAI presents significant challenges due to two central factors: (1) GenAI workloads are intrinsically dynamic--featuring variable sequence lengths and irregular sparsity patterns--and (2) they evolve at a rapid pace, with shifting model architectures and changing deployment requirements. This dissertation addresses these challenges through a co-design approach spanning both compiler and runtime layers, presenting two complementary systems that collectively enable efficient GenAI acceleration. SparseTIR is a tensor compiler specifically designed for sparse deep learning workloads. While sparsity is pervasive in GenAI models, developing high-performance sparse GPU kernels remains difficult due to heterogeneous sparsity patterns and their unique optimization requirements. SparseTIR introduces composable abstractions for both data formats and scheduling transformations, enabling complex optimization strategies with significantly reduced code complexity. It achieves performance competitive with hand-optimized libraries while improving modularity and developer productivity. FlashInfer is a fast and adaptable attention engine tailored for large language model (LLM) inference. As attention increasingly dominates computational costs in modern GenAI models, scalable and customizable GPU kernels become essential. FlashInfer supports block-sparse KV-cache layouts, Just-In-Time (JIT) compilation of parameterized attention templates, and dynamic load-balancing mechanisms compatible with CUDA Graphs. Building on this foundation, we are developing megakernels for low-latency inference and multiplexing inference scenarios. As an open-source project, FlashInfer has pioneered LLM inference kernel development, being among the first to explore techniques like split-KV, GQA packing, and cascade inference. It has been deployed at scale in production environments and fostered a vibrant community across academia and industry. These systems form a cohesive framework for accelerating GenAI workloads through integrated compiler-runtime co-design. They demonstrate how principled systems approaches can achieve both high performance and adaptability in response to rapidly evolving machine learning demands, providing a foundation for future GenAI system development.
dc.embargo.termsOpen Access
dc.format.mimetypeapplication/pdf
dc.identifier.otherYe_washington_0250E_28830.pdf
dc.identifier.urihttps://hdl.handle.net/1773/53959
dc.language.isoen_US
dc.rightsCC BY
dc.subjectCompilers
dc.subjectDomain-Specific Language
dc.subjectGPU
dc.subjectComputer science
dc.subject.otherComputer science and engineering
dc.titleCompiler and Runtime Systems for Generative AI Models
dc.typeThesis

Files

Original bundle

Now showing 1 - 1 of 1
Loading...
Thumbnail Image
Name:
Ye_washington_0250E_28830.pdf
Size:
3.86 MB
Format:
Adobe Portable Document Format