Compiler and Runtime Systems for Generative AI Models

Ye, Zihao

Compiler and Runtime Systems for Generative AI Models

dc.contributor.advisor	Ceze, Luis
dc.contributor.author	Ye, Zihao
dc.date.accessioned	2025-10-02T16:07:17Z
dc.date.available	2025-10-02T16:07:17Z
dc.date.issued	2025-10-02
dc.date.submitted	2025
dc.description	Thesis (Ph.D.)--University of Washington, 2025
dc.description.abstract	Generative AI (GenAI) workloads have rapidly become the predominant data center GPU workload. However, designing efficient GPU kernels for GenAI presents significant challenges due to two central factors: (1) GenAI workloads are intrinsically dynamic--featuring variable sequence lengths and irregular sparsity patterns--and (2) they evolve at a rapid pace, with shifting model architectures and changing deployment requirements. This dissertation addresses these challenges through a co-design approach spanning both compiler and runtime layers, presenting two complementary systems that collectively enable efficient GenAI acceleration. SparseTIR is a tensor compiler specifically designed for sparse deep learning workloads. While sparsity is pervasive in GenAI models, developing high-performance sparse GPU kernels remains difficult due to heterogeneous sparsity patterns and their unique optimization requirements. SparseTIR introduces composable abstractions for both data formats and scheduling transformations, enabling complex optimization strategies with significantly reduced code complexity. It achieves performance competitive with hand-optimized libraries while improving modularity and developer productivity. FlashInfer is a fast and adaptable attention engine tailored for large language model (LLM) inference. As attention increasingly dominates computational costs in modern GenAI models, scalable and customizable GPU kernels become essential. FlashInfer supports block-sparse KV-cache layouts, Just-In-Time (JIT) compilation of parameterized attention templates, and dynamic load-balancing mechanisms compatible with CUDA Graphs. Building on this foundation, we are developing megakernels for low-latency inference and multiplexing inference scenarios. As an open-source project, FlashInfer has pioneered LLM inference kernel development, being among the first to explore techniques like split-KV, GQA packing, and cascade inference. It has been deployed at scale in production environments and fostered a vibrant community across academia and industry. These systems form a cohesive framework for accelerating GenAI workloads through integrated compiler-runtime co-design. They demonstrate how principled systems approaches can achieve both high performance and adaptability in response to rapidly evolving machine learning demands, providing a foundation for future GenAI system development.
dc.embargo.terms	Open Access
dc.format.mimetype	application/pdf
dc.identifier.other	Ye_washington_0250E_28830.pdf
dc.identifier.uri	https://hdl.handle.net/1773/53959
dc.language.iso	en_US
dc.rights	CC BY
dc.subject	Compilers
dc.subject	Domain-Specific Language
dc.subject	GPU
dc.subject	Computer science
dc.subject.other	Computer science and engineering
dc.title	Compiler and Runtime Systems for Generative AI Models
dc.type	Thesis

Files

Original bundle

Now showing 1 - 1 of 1

Name:: Ye_washington_0250E_28830.pdf
Size:: 3.86 MB
Format:: Adobe Portable Document Format

Download

Collections

Computer science and engineering