REFRAG: Rethinking RAG based Decoding

Best AI papers explained

Dec 13, 2025•14 min

--:--

Listen in podcast apps:

Apple Podcasts

Spotify

Download

Listen to this episode in Metacast mobile app

Don't just listen to podcasts. Learn from them with transcripts, summaries, and chapters for every episode. Skim, search, and bookmark insights. Learn more

Episode description

This paperq introduces REFRAG, an innovative and efficient decoding framework specifically designed to accelerate *lRetrieval-Augmented Generation (RAG) in Large Language Models (LLMs) by addressing high latency and memory demands associated with long-context inputs. The core mechanism involves compressing context by representing chunks of retrieved text as single embeddings, significantly shortening the input sequence to the decoder and exploiting the **sparse attention patterns** inherent in RAG contexts. Through techniques like **selective compression** managed by a lightweight reinforcement learning (RL) policy, REFRAG achieves substantial speed improvements—up to **30.85x faster Time-to-First-Token (TTFT)**—without sacrificing accuracy, and enables LLMs to handle context windows up to **16x larger**. Experimental results confirm that this specialized approach outperforms existing methods like CEPE across various tasks, including RAG, multi-turn conversations, and summarization, highlighting a crucial trade-off balance between knowledge enrichment and system efficiency.

For the best experience, listen in Metacast app for iOS or Android