TeleRAG: Optimizing Retrieval for Retrieval-Augmented Generation

Kashyap, Madhav

TeleRAG: Optimizing Retrieval for Retrieval-Augmented Generation

dc.contributor.advisor	Levow, Gina-Anne
dc.contributor.author	Kashyap, Madhav
dc.date.accessioned	2026-02-05T19:37:28Z
dc.date.available	2026-02-05T19:37:28Z
dc.date.issued	2026-02-05
dc.date.submitted	2025
dc.description	Thesis (Master's)--University of Washington, 2025
dc.description.abstract	Retrieval-augmented generation (RAG) has become essential for grounding large language models with external datastores to enhance factual correctness and domain coverage. But deployment presents a critical challenge: large language models and vector datastores compete for limited GPU memory, often forcing datastores to the CPU and leading to slow CPU-based retrieval latency. This thesis introduces TeleRAG, a system that resolves this bottleneck through lookahead retrieval, a technique that predicts and prefetches likely-needed vector search data concurrently with large language model inference. We discover that queries at different RAG pipeline stages exhibit semantic overlap, enabling effective predictive prefetching. TeleRAG combines lookahead retrieval with profile-guided prefetching optimization and GPU-CPU cooperative search. Evaluation across six RAG pipelines demonstrates 1.53× average latency reduction on consumer GPUs and 1.83× throughput improvement in batched-query scenarios. Crucially, TeleRAG remains framework and algorithm agnostic, enabling immediate deployment in existing production systems. By bridging CPU and GPU retrieval, TeleRAG enables efficient RAG deployment for both latency-sensitive and high-throughput applications, advancing retrieval-augmented across diverse environments.
dc.embargo.terms	Open Access
dc.format.mimetype	application/pdf
dc.identifier.other	Kashyap_washington_0250O_29089.pdf
dc.identifier.uri	https://hdl.handle.net/1773/55250
dc.language.iso	en_US
dc.rights	CC BY
dc.subject	artificial intelligence
dc.subject	GPU
dc.subject	large language models
dc.subject	machine learning
dc.subject	natural language processing
dc.subject	systems engineering
dc.subject	Artificial intelligence
dc.subject	Computer science
dc.subject.other	Linguistics
dc.title	TeleRAG: Optimizing Retrieval for Retrieval-Augmented Generation
dc.type	Thesis

Files

Original bundle

Now showing 1 - 1 of 1

Name:: Kashyap_washington_0250O_29089.pdf
Size:: 1.46 MB
Format:: Adobe Portable Document Format

Download

Collections

Linguistics