On the Limits of Test-Time Compute: Sequential Reward Filtering for Better Inference
Dec 07, 2025•14 min
Episode description
This paper analyzes the fundalmental limitations of Best-of-N (BoN) sampling, proving theoretically that they are suboptimal under a mixture-of-reference-policies model. They propose RF-SeqBoN as a sequential approach that improves efficiency by selectively incorporating only **high-reward generations** back into the LLM's context, thereby concentrating computation on superior policy candidates. Both the theoretical analysis and extensive empirical results on diverse reasoning benchmarks confirm that RF-SeqBoN achieves a **strictly better performance-to-budget trade-off** compared to existing TTC baselines.
For the best experience, listen in Metacast app for iOS or Android
