TeleRAG: Optimizing Retrieval for Retrieval-Augmented Generation

Loading...
Thumbnail Image

Journal Title

Journal ISSN

Volume Title

Publisher

Abstract

Retrieval-augmented generation (RAG) has become essential for grounding large language models with external datastores to enhance factual correctness and domain coverage. But deployment presents a critical challenge: large language models and vector datastores compete for limited GPU memory, often forcing datastores to the CPU and leading to slow CPU-based retrieval latency. This thesis introduces TeleRAG, a system that resolves this bottleneck through lookahead retrieval, a technique that predicts and prefetches likely-needed vector search data concurrently with large language model inference. We discover that queries at different RAG pipeline stages exhibit semantic overlap, enabling effective predictive prefetching. TeleRAG combines lookahead retrieval with profile-guided prefetching optimization and GPU-CPU cooperative search. Evaluation across six RAG pipelines demonstrates 1.53× average latency reduction on consumer GPUs and 1.83× throughput improvement in batched-query scenarios. Crucially, TeleRAG remains framework and algorithm agnostic, enabling immediate deployment in existing production systems. By bridging CPU and GPU retrieval, TeleRAG enables efficient RAG deployment for both latency-sensitive and high-throughput applications, advancing retrieval-augmented across diverse environments.

Description

Thesis (Master's)--University of Washington, 2025

Citation

DOI

Collections