On the Limits of Test-Time Compute: Sequential Reward Filtering for Better Inference - podcast episode cover

On the Limits of Test-Time Compute: Sequential Reward Filtering for Better Inference

Dec 07, 202514 min
--:--
--:--
Download Metacast podcast app
Listen to this episode in Metacast mobile app
Don't just listen to podcasts. Learn from them with transcripts, summaries, and chapters for every episode. Skim, search, and bookmark insights. Learn more

Episode description

This paper analyzes the fundalmental limitations of Best-of-N (BoN) sampling, proving theoretically that they are suboptimal under a mixture-of-reference-policies model. They propose RF-SeqBoN as a sequential approach that improves efficiency by selectively incorporating only **high-reward generations** back into the LLM's context, thereby concentrating computation on superior policy candidates. Both the theoretical analysis and extensive empirical results on diverse reasoning benchmarks confirm that RF-SeqBoN achieves a **strictly better performance-to-budget trade-off** compared to existing TTC baselines.

For the best experience, listen in Metacast app for iOS or Android