Can sparse autoencoders be used to decompose and interpret steering vectors?

Daily Paper Cast

Nov 15, 2024•22 min•Ep. 79

--:--

Listen in podcast apps:

Apple Podcasts

Spotify

Download

Listen to this episode in Metacast mobile app

Don't just listen to podcasts. Learn from them with transcripts, summaries, and chapters for every episode. Skim, search, and bookmark insights. Learn more

Episode description

🤗 Paper Upvotes: 6 | cs.LG, cs.AI, cs.CL

Authors:
Harry Mayne, Yushi Yang, Adam Mahdi

Title:
Can sparse autoencoders be used to decompose and interpret steering vectors?

Arxiv:
http://arxiv.org/abs/2411.08790v1

Abstract:
Steering vectors are a promising approach to control the behaviour of large language models. However, their underlying mechanisms remain poorly understood. While sparse autoencoders (SAEs) may offer a potential method to interpret steering vectors, recent findings show that SAE-reconstructed vectors often lack the steering properties of the original vectors. This paper investigates why directly applying SAEs to steering vectors yields misleading decompositions, identifying two reasons: (1) steering vectors fall outside the input distribution for which SAEs are designed, and (2) steering vectors can have meaningful negative projections in feature directions, which SAEs are not designed to accommodate. These limitations hinder the direct use of SAEs for interpreting steering vectors.

For the best experience, listen in Metacast app for iOS or Android