43 - David Lindner on Myopic Optimization with Non-myopic Approval - podcast episode cover

43 - David Lindner on Myopic Optimization with Non-myopic Approval

Jun 15, 20251 hr 41 min
--:--
--:--
Download Metacast podcast app
Listen to this episode in Metacast mobile app
Don't just listen to podcasts. Learn from them with transcripts, summaries, and chapters for every episode. Skim, search, and bookmark insights. Learn more

Episode description

In this episode, I talk with David Lindner about Myopic Optimization with Non-myopic Approval, or MONA, which attempts to address (multi-step) reward hacking by myopically optimizing actions against a human's sense of whether those actions are generally good. Does this work? Can we get smarter-than-human AI this way? How does this compare to approaches like conservativism? Listen to find out.

Patreon: https://www.patreon.com/axrpodcast

Ko-fi: https://ko-fi.com/axrpodcast

Transcript: https://axrp.net/episode/2025/06/15/episode-43-david-lindner-mona.html

 

Topics we discuss, and timestamps:

0:00:29 What MONA is

0:06:33 How MONA deals with reward hacking

0:23:15 Failure cases for MONA

0:36:25 MONA's capability

0:55:40 MONA vs other approaches

1:05:03 Follow-up work

1:10:17 Other MONA test cases

1:33:47 When increasing time horizon doesn't increase capability

1:39:04 Following David's research

 

Links for David:

Website: https://www.davidlindner.me

Twitter / X: https://x.com/davlindner

DeepMind Medium: https://deepmindsafetyresearch.medium.com

David on the Alignment Forum: https://www.alignmentforum.org/users/david-lindner

 

Research we discuss:

MONA: Myopic Optimization with Non-myopic Approval Can Mitigate Multi-step Reward Hacking: https://arxiv.org/abs/2501.13011

Arguments Against Myopic Training: https://www.alignmentforum.org/posts/GqxuDtZvfgL2bEQ5v/arguments-against-myopic-training

 

Episode art by Hamish Doodles: hamishdoodles.com

For the best experience, listen in Metacast app for iOS or Android