Offline Preference Learning via Simulated Trajectory Feedback

Best AI papers explained

Apr 24, 2025•17 min

--:--

Listen in podcast apps:

Apple Podcasts

Spotify

Download

Listen to this episode in Metacast mobile app

Don't just listen to podcasts. Learn from them with transcripts, summaries, and chapters for every episode. Skim, search, and bookmark insights. Learn more

Episode description

This paper explores efficient ways to learn optimal decision-making policies from offline data by incorporating human preferences, addressing scenarios where direct interaction with the environment or a predefined reward function is impractical. It bridges the gap between offline reinforcement learning and preference-based reinforcement learning, focusing on minimizing the number of human queries needed. The authors propose a novel algorithm, Sim-OPRL, which leverages a learned environment model to simulate potential outcomes and elicit informative feedback. Theoretical analysis demonstrates the algorithm's sample efficiency depends on how well the offline data covers the optimal behavior, and empirical evaluations confirm its superior performance over existing offline preference learning methods.

For the best experience, listen in Metacast app for iOS or Android