Getting More Juice Out of the SFT Data: Reward Learning from Human Demonstration Improves SFT

Best AI papers explained

May 02, 2025•13 min

--:--

Listen in podcast apps:

Apple Podcasts

Spotify

Download

Listen to this episode in Metacast mobile app

Don't just listen to podcasts. Learn from them with transcripts, summaries, and chapters for every episode. Skim, search, and bookmark insights. Learn more

Episode description

This academic paper proposes an innovative approach to fine-tune Large Language Models (LLMs) using demonstration data, which typically only provides examples of desired outputs. Unlike standard supervised fine-tuning (SFT) methods that directly mimic demonstrations, this work argues that reward learning from this data can significantly enhance LLM alignment with human preferences. The authors introduce two novel algorithms, Reward-learning Fine-tune (RFT) and Implicit Reward-learning Fine-tune (IRFT), based on an Inverse Reinforcement Learning (IRL) framework, which jointly learns a reward model and the language model policy. They demonstrate through theoretical analysis and empirical results on different LLM sizes and datasets that these reward-based methods consistently outperform traditional SFT in various evaluation metrics, including the HuggingFace Open LLM Leaderboard. The paper also reveals an interesting connection between their implicit reward learning approach and the recent Self-Play Fine-tune (SPIN) algorithm, providing a theoretical grounding for this type of training.

For the best experience, listen in Metacast app for iOS or Android