Not All Thoughts Matter: Selective Attention for Efficient Reasoning

Best AI papers explained

Nov 19, 2025•13 min

--:--

Listen in podcast apps:

Apple Podcasts

Spotify

Download

Listen to this episode in Metacast mobile app

Don't just listen to podcasts. Learn from them with transcripts, summaries, and chapters for every episode. Skim, search, and bookmark insights. Learn more

Episode description

This paper studies an inference-time optimization technique designed to reduce the high computational cost of reasoning-optimized large language models (LLMs), which generate long chains of thought. LLMs' self-attention mechanism typically scales quadratically with sequence length, making long reasoning chains prohibitively expensive. RWR addresses this by exploiting the redundancy in intermediate reasoning steps, maintaining only two strategically chosen parts of the key-value (KV) cache: the first window, which holds critical problem context, and the last window, containing the most recent reasoning steps. This simple approach significantly reduces memory and compute requirements, achieving similar accuracy with up to a 50% KV-cache budget reduction, which translates to substantial memory and compute savings across tasks like math reasoning, code generation, and academic question answering, even for models trained with full quadratic attention.

For the best experience, listen in Metacast app for iOS or Android