33 - RLHF Problems with Scott Emmons - podcast episode cover

33 - RLHF Problems with Scott Emmons

Jun 12, 20241 hr 41 min
--:--
--:--
Download Metacast podcast app
Listen to this episode in Metacast mobile app
Don't just listen to podcasts. Learn from them with transcripts, summaries, and chapters for every episode. Skim, search, and bookmark insights. Learn more

Episode description

Reinforcement Learning from Human Feedback, or RLHF, is one of the main ways that makers of large language models make them 'aligned'. But people have long noted that there are difficulties with this approach when the models are smarter than the humans providing feedback. In this episode, I talk with Scott Emmons about his work categorizing the problems that can show up in this setting.

Patreon: patreon.com/axrpodcast

Ko-fi: ko-fi.com/axrpodcast

The transcript: https://axrp.net/episode/2024/06/12/episode-33-rlhf-problems-scott-emmons.html

Topics we discuss, and timestamps:

0:00:33 - Deceptive inflation

0:17:56 - Overjustification

0:32:48 - Bounded human rationality

0:50:46 - Avoiding these problems

1:14:13 - Dimensional analysis

1:23:32 - RLHF problems, in theory and practice

1:31:29 - Scott's research program

1:39:42 - Following Scott's research

 

Scott's website: https://www.scottemmons.com

Scott's X/twitter account: https://x.com/emmons_scott

When Your AIs Deceive You: Challenges With Partial Observability of Human Evaluators in Reward Learning: https://arxiv.org/abs/2402.17747

 

Other works we discuss:

AI Deception: A Survey of Examples, Risks, and Potential Solutions: https://arxiv.org/abs/2308.14752

Uncertain decisions facilitate better preference learning: https://arxiv.org/abs/2106.10394

Invariance in Policy Optimisation and Partial Identifiability in Reward Learning: https://arxiv.org/abs/2203.07475

The Humble Gaussian Distribution (aka principal component analysis and dimensional analysis): http://www.inference.org.uk/mackay/humble.pdf

Fine-tuning Aligned Language Models Compromises Safety, Even When Users Do Not Intend To!: https://arxiv.org/abs/2310.03693

 

Episode art by Hamish Doodles: hamishdoodles.com

For the best experience, listen in Metacast app for iOS or Android