38.5 - Adrià Garriga-Alonso on Detecting AI Scheming - podcast episode cover

38.5 - Adrià Garriga-Alonso on Detecting AI Scheming

Jan 20, 202528 min
--:--
--:--
Download Metacast podcast app
Listen to this episode in Metacast mobile app
Don't just listen to podcasts. Learn from them with transcripts, summaries, and chapters for every episode. Skim, search, and bookmark insights. Learn more

Episode description

Suppose we're worried about AIs engaging in long-term plans that they don't tell us about. If we were to peek inside their brains, what should we look for to check whether this was happening? In this episode Adrià Garriga-Alonso talks about his work trying to answer this question.

Patreon: https://www.patreon.com/axrpodcast

Ko-fi: https://ko-fi.com/axrpodcast

Transcript: https://axrp.net/episode/2025/01/20/episode-38_5-adria-garriga-alonso-detecting-ai-scheming.html

FAR.AIhttps://far.ai/

FAR.AI on X (aka Twitter): https://x.com/farairesearch

FAR.AI on YouTube: https://www.youtube.com/@FARAIResearch

The Alignment Workshop: https://www.alignment-workshop.com/

 

Topics we discuss, and timestamps:

01:04 - The Alignment Workshop

02:49 - How to detect scheming AIs

05:29 - Sokoban-solving networks taking time to think

12:18 - Model organisms of long-term planning

19:44 - How and why to study planning in networks

 

Links:

Adrià's website: https://agarri.ga/

An investigation of model-free planning: https://arxiv.org/abs/1901.03559

Model-Free Planning: https://tuphs28.github.io/projects/interpplanning/

Planning in a recurrent neural network that plays Sokoban: https://arxiv.org/abs/2407.15421

 

Episode art by Hamish Doodles: hamishdoodles.com

For the best experience, listen in Metacast app for iOS or Android