The Path Not Taken: RLVR Provably Learns Off the Principals

Best AI papers explained

Nov 23, 2025•12 min

--:--

Listen in podcast apps:

Apple Podcasts

Spotify

Download

Listen to this episode in Metacast mobile app

Don't just listen to podcasts. Learn from them with transcripts, summaries, and chapters for every episode. Skim, search, and bookmark insights. Learn more

Episode description

This paper studies mechanistic explanation for the paradox that **Reinforcement Learning with Verifiable Rewards (RLVR)** reliably improves large language model reasoning while making only minimal, sparse changes to parameters. The authors introduce the **Three-Gate Theory**, arguing that sparse updates are a surface artifact of a **model-conditioned optimization bias**. **Gate I (KL Anchor)** constrains each update, while **Gate II (Model Geometry)** steers the updates off the principal, high-curvature directions favored by **Supervised Fine-Tuning (SFT)** and into low-curvature subspaces, thereby preserving the model's spectral structure. **Gate III (Precision)** amplifies the appearance of sparsity by masking small updates in non-preferred regions due to bfloat16 storage limits. Consequently, the work demonstrates that **RLVR learns in a distinct optimization regime from SFT**, which suggests that SFT-era parameter-efficient fine-tuning (PEFT) techniques are often ill-suited for RL applications.

For the best experience, listen in Metacast app for iOS or Android