"Sparse Autoencoders Find Highly Interpretable Directions in Language Models" by Logan Riggs et al - podcast episode cover

"Sparse Autoencoders Find Highly Interpretable Directions in Language Models" by Logan Riggs et al

Sep 27, 202310 min
--:--
--:--
Download Metacast podcast app
Listen to this episode in Metacast mobile app
Don't just listen to podcasts. Learn from them with transcripts, summaries, and chapters for every episode. Skim, search, and bookmark insights. Learn more

Episode description

This is a linkpost for Sparse Autoencoders Find Highly Interpretable Directions in Language Models

We use a scalable and unsupervised method called Sparse Autoencoders to find interpretable, monosemantic features in real LLMs (Pythia-70M/410M) for both residual stream and MLPs. We showcase monosemantic features, feature replacement for Indirect Object Identification (IOI), and use OpenAI's automatic interpretation protocol to demonstrate a significant improvement in interpretability.

Source:
https://www.lesswrong.com/posts/Qryk6FqjtZk9FHHJR/sparse-autoencoders-find-highly-interpretable-directions-in

Narrated for LessWrong by TYPE III AUDIO.

Share feedback on this narration.

[125+ Karma Post]

For the best experience, listen in Metacast app for iOS or Android