TeleRAG: Optimizing Retrieval for Retrieval-Augmented Generation

dc.contributor.advisorLevow, Gina-Anne
dc.contributor.authorKashyap, Madhav
dc.date.accessioned2026-02-05T19:37:28Z
dc.date.available2026-02-05T19:37:28Z
dc.date.issued2026-02-05
dc.date.submitted2025
dc.descriptionThesis (Master's)--University of Washington, 2025
dc.description.abstractRetrieval-augmented generation (RAG) has become essential for grounding large language models with external datastores to enhance factual correctness and domain coverage. But deployment presents a critical challenge: large language models and vector datastores compete for limited GPU memory, often forcing datastores to the CPU and leading to slow CPU-based retrieval latency. This thesis introduces TeleRAG, a system that resolves this bottleneck through lookahead retrieval, a technique that predicts and prefetches likely-needed vector search data concurrently with large language model inference. We discover that queries at different RAG pipeline stages exhibit semantic overlap, enabling effective predictive prefetching. TeleRAG combines lookahead retrieval with profile-guided prefetching optimization and GPU-CPU cooperative search. Evaluation across six RAG pipelines demonstrates 1.53× average latency reduction on consumer GPUs and 1.83× throughput improvement in batched-query scenarios. Crucially, TeleRAG remains framework and algorithm agnostic, enabling immediate deployment in existing production systems. By bridging CPU and GPU retrieval, TeleRAG enables efficient RAG deployment for both latency-sensitive and high-throughput applications, advancing retrieval-augmented across diverse environments.
dc.embargo.termsOpen Access
dc.format.mimetypeapplication/pdf
dc.identifier.otherKashyap_washington_0250O_29089.pdf
dc.identifier.urihttps://hdl.handle.net/1773/55250
dc.language.isoen_US
dc.rightsCC BY
dc.subjectartificial intelligence
dc.subjectGPU
dc.subjectlarge language models
dc.subjectmachine learning
dc.subjectnatural language processing
dc.subjectsystems engineering
dc.subjectArtificial intelligence
dc.subjectComputer science
dc.subject.otherLinguistics
dc.titleTeleRAG: Optimizing Retrieval for Retrieval-Augmented Generation
dc.typeThesis

Files

Original bundle

Now showing 1 - 1 of 1
Loading...
Thumbnail Image
Name:
Kashyap_washington_0250O_29089.pdf
Size:
1.46 MB
Format:
Adobe Portable Document Format

Collections