Taylor, MichaelShi, RichardPeng, Huwan2025-10-022025-10-022025-10-022025Peng_washington_0250E_28662.pdfhttps://hdl.handle.net/1773/53993Thesis (Ph.D.)--University of Washington, 2025The rapid advancements in large language models (LLMs) have significantly reshaped the artificial intelligence landscape, enabling transformative applications. However, these developments pose profound challenges for hardware architectures, particularly concerning performance, efficiency, and scalability. This dissertation investigates these critical challenges, proposing novel methodologies and architectural designs for specialized hardware, with a primary focus on optimizing large-scale LLM inference. Core contributions of this thesis include the development of ReaLLM and Chiplet Cloud. ReaLLM is a holistic simulation framework for LLM serving, designed to bridge detailed accelerator-level insights with system-wide performance evaluations. This framework facilitates rapid exploration and precise simulation of both hardware architectures and software strategies. Chiplet Cloud is a cloud-scale architecture optimized for the Total Cost of Ownership (TCO) of LLM inference. Its key architectural innovations include fitting model parameters within on-chip memory to improve performance, co-optimizing chip size with software mapping to reduce TCO, and effectively exploiting model sparsity to support larger models. Additionally, the thesis discusses ChronoStack, a 3D memory architecture developed as part of a collaborative research effort, featuring a novel Time-Multiplexed KV-Prefetching technique, specifically optimized for the demands of long-context LLMs. The dissertation also incorporates foundational research on accelerators for earlier AI paradigms, including Convolutional Neural Networks (CNNs), Recurrent Neural Networks (RNNs), and deep reinforcement learning, providing a broad perspective on AI hardware evolution. Together, this body of work presents a detailed investigation into architectures and methodologies for AI inference hardware, tracing a clear progression from foundational network acceleration to modern large language model serving. The research aims to contribute novel approaches and critical insights towards achieving efficient, high-performance computing for the advancing field of artificial intelligence.application/pdfen-USCC BYElectrical engineeringElectrical and computer engineeringMethodologies and Architectures for AI Inference Hardware: From Foundational Networks to Large Language ModelsThesis