Eliciting Secret Knowledge from Language Models - podcast episode cover

Eliciting Secret Knowledge from Language Models

Oct 06, 202515 min
--:--
--:--
Download Metacast podcast app
Listen to this episode in Metacast mobile app
Don't just listen to podcasts. Learn from them with transcripts, summaries, and chapters for every episode. Skim, search, and bookmark insights. Learn more

Episode description

This academic paper investigates the critical challenge of eliciting secret knowledge from Large Language Models (LLMs) that have been intentionally trained to possess and conceal specific information. The researchers created a controlled testbed with three "secret-keeping" LLMs—Taboo, Secret Side Constraint (SSC), and User Gender—each hiding a different type of fact. They evaluated various black-box techniques, such as prefill attacks and user persona sampling, and white-box techniques, including Logit Lens and Sparse Autoencoders (SAEs), to see which methods most successfully enabled an auditor LLM to guess the secret. The findings demonstrate that both black-box prefilling methods and white-box mechanistic interpretability tools significantly improve the auditor's success rate in uncovering the models' hidden knowledge. The authors conclude by open-sourcing their code and models to establish a public benchmark for future AI safety research in this area.

For the best experience, listen in Metacast app for iOS or Android