RL's Razor: Why Online RL Forgets Less

Best AI papers explained

Sep 07, 2025•25 min

--:--

Listen in podcast apps:

Apple Podcasts

Spotify

Download

Listen to this episode in Metacast mobile app

Don't just listen to podcasts. Learn from them with transcripts, summaries, and chapters for every episode. Skim, search, and bookmark insights. Learn more

Episode description

This paper explores why **Reinforcement Learning (RL) fine-tuning leads to less catastrophic forgetting** in models compared to **Supervised Fine-Tuning (SFT)**, even when both achieve similar performance on new tasks. The authors introduce **"RL's Razor,"** a principle stating that **RL is implicitly biased towards solutions that cause minimal change (KL divergence) from the original model's policy** when learning new tasks. Empirical and theoretical evidence supports this, demonstrating that **KL divergence on the new task is a strong predictor of forgetting**, regardless of the training algorithm. The core reason for RL's advantage is its **on-policy training**, which samples from the model's current distribution and reweights those samples, leading to more conservative and KL-minimal updates compared to SFT's reliance on fixed external annotations.

For the best experience, listen in Metacast app for iOS or Android