What’s In My Human Feedback? Learning Interpretable Descriptions of Preference Data

Best AI papers explained

Dec 19, 2025•16 min

--:--

Listen in podcast apps:

Apple Podcasts

Spotify

Download

Listen to this episode in Metacast mobile app

Don't just listen to podcasts. Learn from them with transcripts, summaries, and chapters for every episode. Skim, search, and bookmark insights. Learn more

Episode description

This paper introduces a method for automatically decoding hidden preferences from language model training data. By utilizing sparse autoencoders, the method translates complex text embeddings into a small set of interpretable features that explain why human annotators prefer one response over another. The research reveals that feedback datasets often contain conflicting signals, such as Reddit users favoring informal jokes while other groups disfavor them. Notably, the authors demonstrate that What’s In My Human Feedback? (WIMHF) can identify misaligned or unsafe preferences, such as a bias against model refusals in certain benchmarks. These discovered features allow developers to curate safer datasets by flipping harmful labels and to personalize model behavior based on specific user stylistic choices. Ultimately, the work provides a human-centered diagnostic tool to make the black-box process of model alignment more transparent and controllable.

For the best experience, listen in Metacast app for iOS or Android