Optimizing Chain-of-Thought Reasoners via Gradient Variance Minimization in Rejection Sampling and RL

Best AI papers explained

May 09, 2025•16 min

--:--

Listen in podcast apps:

Apple Podcasts

Spotify

Download

Listen to this episode in Metacast mobile app

Don't just listen to podcasts. Learn from them with transcripts, summaries, and chapters for every episode. Skim, search, and bookmark insights. Learn more

Episode description

This paper presents Gradient Variance Minimization (GVM), a novel technique for optimizing Chain-of-Thought (CoT) reasoning in Large Language Models (LLMs). The core idea is to dynamically allocate computational resources (sampling budget) across prompts based on their difficulty and gradient norms, aiming to minimize the variance of the stochastic gradient estimation. Unlike traditional methods that use uniform sampling, GVM-RAFT, an adaptation of the RAFT algorithm, employs a two-stage process where it first estimates prompt characteristics and then assigns samples to reduce training noise. This dynamic approach demonstrates accelerated convergence and improved accuracy in mathematical reasoning tasks. The authors also show that the GVM strategy can be generalized to other reinforcement learning (RL) algorithms like GRPO, yielding similar benefits.

For the best experience, listen in Metacast app for iOS or Android