Not All Thoughts Matter: Selective Attention for Efficient Reasoning - podcast episode cover

Not All Thoughts Matter: Selective Attention for Efficient Reasoning

Nov 19, 202513 min
--:--
--:--
Download Metacast podcast app
Listen to this episode in Metacast mobile app
Don't just listen to podcasts. Learn from them with transcripts, summaries, and chapters for every episode. Skim, search, and bookmark insights. Learn more

Episode description

This paper studies an inference-time optimization technique designed to reduce the high computational cost of reasoning-optimized large language models (LLMs), which generate long chains of thought. LLMs' self-attention mechanism typically scales quadratically with sequence length, making long reasoning chains prohibitively expensive. RWR addresses this by exploiting the redundancy in intermediate reasoning steps, maintaining only two strategically chosen parts of the key-value (KV) cache: the first window, which holds critical problem context, and the last window, containing the most recent reasoning steps. This simple approach significantly reduces memory and compute requirements, achieving similar accuracy with up to a 50% KV-cache budget reduction, which translates to substantial memory and compute savings across tasks like math reasoning, code generation, and academic question answering, even for models trained with full quadratic attention.

For the best experience, listen in Metacast app for iOS or Android