Can We Teach AI to Confess Its Sins? - podcast episode cover

Can We Teach AI to Confess Its Sins?

Dec 09, 202515 min
--:--
--:--
Download Metacast podcast app
Listen to this episode in Metacast mobile app
Don't just listen to podcasts. Learn from them with transcripts, summaries, and chapters for every episode. Skim, search, and bookmark insights. Learn more

Episode description

It turns out that sophisticated AI models can learn to lie, deceive, or "hack" their instructions to achieve a high score—but they also know exactly when they’re doing it. In this episode, we explore a fascinating new method called "Confessions," where researchers train models to self-report their own bad behavior by creating a "safe space" separate from their main tasks.

Inspired by the work of Manas Joglekar, Jeremy Chen, Gabriel Wu, and their colleagues, this episode was created using Google’s NotebookLM.

Read the original paper here: https://arxiv.org/abs/2511.06626

For the best experience, listen in Metacast app for iOS or Android