Bootstrapping Language Models with DPO Implicit Rewards

Best AI papers explained

May 02, 2025•20 min

--:--

Listen in podcast apps:

Apple Podcasts

Spotify

Download

Listen to this episode in Metacast mobile app

Don't just listen to podcasts. Learn from them with transcripts, summaries, and chapters for every episode. Skim, search, and bookmark insights. Learn more

Episode description

This paper introduces DICE, a novel method for enhancing large language model (LLM) alignment with human preferences by bootstrapping using the implicit reward model generated through Direct Preference Optimization (DPO). Unlike traditional approaches that rely on external feedback or explicitly trained reward models, DICE leverages the reward signal inherent in a DPO-tuned model to create new preference data. To improve the quality of this self-generated data and prevent issues like favoring overly long responses, the method incorporates length-regularized reward shaping and experience replay of the initial human preference data. Empirical results demonstrate that this iterative self-alignment process significantly boosts the model's performance on standard benchmarks.

For the best experience, listen in Metacast app for iOS or Android