RL's Razor: Why Online RL Forgets Less - podcast episode cover

RL's Razor: Why Online RL Forgets Less

Sep 07, 202525 min
--:--
--:--
Download Metacast podcast app
Listen to this episode in Metacast mobile app
Don't just listen to podcasts. Learn from them with transcripts, summaries, and chapters for every episode. Skim, search, and bookmark insights. Learn more

Episode description

This paper explores why **Reinforcement Learning (RL) fine-tuning leads to less catastrophic forgetting** in models compared to **Supervised Fine-Tuning (SFT)**, even when both achieve similar performance on new tasks. The authors introduce **"RL's Razor,"** a principle stating that **RL is implicitly biased towards solutions that cause minimal change (KL divergence) from the original model's policy** when learning new tasks. Empirical and theoretical evidence supports this, demonstrating that **KL divergence on the new task is a strong predictor of forgetting**, regardless of the training algorithm. The core reason for RL's advantage is its **on-policy training**, which samples from the model's current distribution and reweights those samples, leading to more conservative and KL-minimal updates compared to SFT's reliance on fixed external annotations.

For the best experience, listen in Metacast app for iOS or Android