41 - Lee Sharkey on Attribution-based Parameter Decomposition - podcast episode cover

41 - Lee Sharkey on Attribution-based Parameter Decomposition

Jun 03, 20252 hr 16 min
--:--
--:--
Download Metacast podcast app
Listen to this episode in Metacast mobile app
Don't just listen to podcasts. Learn from them with transcripts, summaries, and chapters for every episode. Skim, search, and bookmark insights. Learn more

Episode description

What's the next step forward in interpretability? In this episode, I chat with Lee Sharkey about his proposal for detecting computational mechanisms within neural networks: Attribution-based Parameter Decomposition, or APD for short.

Patreon: https://www.patreon.com/axrpodcast

Ko-fi: https://ko-fi.com/axrpodcast

Transcript: https://axrp.net/episode/2025/06/03/episode-41-lee-sharkey-attribution-based-parameter-decomposition.html

 

Topics we discuss, and timestamps:

0:00:41 APD basics

0:07:57 Faithfulness

0:11:10 Minimality

0:28:44 Simplicity

0:34:50 Concrete-ish examples of APD

0:52:00 Which parts of APD are canonical

0:58:10 Hyperparameter selection

1:06:40 APD in toy models of superposition

1:14:40 APD and compressed computation

1:25:43 Mechanisms vs representations

1:34:41 Future applications of APD?

1:44:19 How costly is APD?

1:49:14 More on minimality training

1:51:49 Follow-up work

2:05:24 APD on giant chain-of-thought models?

2:11:27 APD and "features"

2:14:11 Following Lee's work

 

Lee links (Leenks):

X/Twitter: https://twitter.com/leedsharkey

Alignment Forum: https://www.alignmentforum.org/users/lee_sharkey

 

Research we discuss:

Interpretability in Parameter Space: Minimizing Mechanistic Description Length with Attribution-Based Parameter Decomposition: https://arxiv.org/abs/2501.14926

Toy Models of Superposition: https://transformer-circuits.pub/2022/toy_model/index.html

Towards a unified and verified understanding of group-operation networks: https://arxiv.org/abs/2410.07476

Feature geometry is outside the superposition hypothesis: https://www.alignmentforum.org/posts/MFBTjb2qf3ziWmzz6/sae-feature-geometry-is-outside-the-superposition-hypothesis

 

Episode art by Hamish Doodles: hamishdoodles.com

For the best experience, listen in Metacast app for iOS or Android