TeleRAG: Optimizing Retrieval for Retrieval-Augmented Generation

Kashyap, Madhav

TeleRAG: Optimizing Retrieval for Retrieval-Augmented Generation

Files

Kashyap_washington_0250O_29089.pdf (1.46 MB)

Date

2026-02-05

Authors

Kashyap, Madhav

Abstract

Retrieval-augmented generation (RAG) has become essential for grounding large language models with external datastores to enhance factual correctness and domain coverage. But deployment presents a critical challenge: large language models and vector datastores compete for limited GPU memory, often forcing datastores to the CPU and leading to slow CPU-based retrieval latency. This thesis introduces TeleRAG, a system that resolves this bottleneck through lookahead retrieval, a technique that predicts and prefetches likely-needed vector search data concurrently with large language model inference. We discover that queries at different RAG pipeline stages exhibit semantic overlap, enabling effective predictive prefetching. TeleRAG combines lookahead retrieval with profile-guided prefetching optimization and GPU-CPU cooperative search. Evaluation across six RAG pipelines demonstrates 1.53× average latency reduction on consumer GPUs and 1.83× throughput improvement in batched-query scenarios. Crucially, TeleRAG remains framework and algorithm agnostic, enabling immediate deployment in existing production systems. By bridging CPU and GPU retrieval, TeleRAG enables efficient RAG deployment for both latency-sensitive and high-throughput applications, advancing retrieval-augmented across diverse environments.