Identifiable Steering via Sparse Autoencoding of Multi-Concept Shifts

Best AI papers explained

May 06, 2025•12 min

--:--

Listen in podcast apps:

Apple Podcasts

Spotify

Download

Listen to this episode in Metacast mobile app

Don't just listen to podcasts. Learn from them with transcripts, summaries, and chapters for every episode. Skim, search, and bookmark insights. Learn more

Episode description

We introduce Sparse Shift Autoencoders (SSAEs), a novel method for learning to steer Large Language Models (LLMs) by manipulating their internal representations. Unlike traditional steering techniques that rely on expensive supervised data varying in single concepts, SSAEs are designed to learn from paired observations where multiple, unknown concepts change simultaneously. By mapping these embedding differences to sparse representations that correspond to individual concept shifts, SSAEs leverage sparsity regularization to ensure that the learned steering vectors are identifiable, meaning they accurately reflect the change in a single concept. Empirical results using Llama-3.1 embeddings on various language datasets demonstrate that SSAEs achieve high identifiability and enable accurate steering, even showing robustness to increased entanglement of the representations.

For the best experience, listen in Metacast app for iOS or Android