Minimalist LLM Reasoning: Rejection Sampling to Reinforcement

Best AI papers explained

Apr 19, 2025•13 min

--:--

Listen in podcast apps:

Apple Podcasts

Spotify

Download

Listen to this episode in Metacast mobile app

Don't just listen to podcasts. Learn from them with transcripts, summaries, and chapters for every episode. Skim, search, and bookmark insights. Learn more

Episode description

This paper investigates reinforcement learning methods for fine-tuning large language models on complex reasoning tasks, particularly mathematical problems. The authors analyze GRPO, a successful but poorly understood algorithm, and surprisingly find that a simpler rejection sampling method, RAFT, achieves comparable results by training only on positively rewarded samples. Their analysis reveals that GRPO's effectiveness stems mainly from discarding prompts with entirely incorrect responses, leading them to propose Reinforce-Rej, a refined algorithm that also filters entirely correct samples for improved efficiency and stability. The study advocates for RAFT as a robust baseline and suggests future work prioritize principled negative sample integration over indiscriminate use.

For the best experience, listen in Metacast app for iOS or Android