The Utility of Interpretability — Emmanuel Amiesen - podcast episode cover

The Utility of Interpretability — Emmanuel Amiesen

Jun 06, 2025
--:--
--:--
Listen in podcast apps:
Metacast
Spotify
Youtube
RSS

Episode description

Emmanuel Amiesen is lead author of “Circuit Tracing: Revealing Computational Graphs in Language Models” (https://transformer-circuits.pub/2025/attribution-graphs/methods.html ), which is part of a duo of MechInterp papers that Anthropic published in March (alongside https://transformer-circuits.pub/2025/attribution-graphs/biology.html ). We recorded the initial conversation a month ago, but then held off publishing until the open source tooling for the graph generation discussed in this work was released last week: https://www.anthropic.com/research/open-source-circuit-tracing This is a 2 part episode - an intro covering the open source release, then a deeper dive into the paper — with guest host Vibhu Sapra (https://x.com/vibhuuuus ) and Mochi the MechInterp Pomsky (https://x.com/mochipomsky ). Thanks to Vibhu for making this episode happen! While the original blogpost contained some fantastic guided visualizations (which we discuss at the end of this pod!), with the notebook and Neuronpedia visualization (https://www.neuronpedia.org/gemma-2-2b/graph ) released this week, you can now explore on your own with Neuronpedia, as we show you in the video version of this pod. Chapters 00:00 Intro & Guest Introductions 01:00 Anthropic's Circuit Tracing Release 06:11 Exploring Circuit Tracing Tools & Demos 13:01 Model Behaviors and User Experiments 17:02 Behind the Research: Team and Community 24:19 Main Episode Start: Mech Interp Backgrounds 25:56 Getting Into Mech Interp Research 31:52 History and Foundations of Mech Interp 37:05 Core Concepts: Superposition & Features 39:54 Applications & Interventions in Models 45:59 Challenges & Open Questions in Interpretability 57:15 Understanding Model Mechanisms: Circuits & Reasoning 01:04:24 Model Planning, Reasoning, and Attribution Graphs 01:30:52 Faithfulness, Deception, and Parallel Circuits 01:40:16 Publishing Risks, Open Research, and Visualization 01:49:33 Barriers, Vision, and Call to Action
For the best experience, listen in Metacast app for iOS or Android
Open in Metacast
The Utility of Interpretability — Emmanuel Amiesen | Latent Space: The AI Engineer Podcast - Listen or read transcript on Metacast