Eliciting Secret Knowledge from Language Models

Best AI papers explained

Oct 06, 2025•15 min

--:--

Listen in podcast apps:

Apple Podcasts

Spotify

Download

Listen to this episode in Metacast mobile app

Don't just listen to podcasts. Learn from them with transcripts, summaries, and chapters for every episode. Skim, search, and bookmark insights. Learn more

Episode description

This academic paper investigates the critical challenge of eliciting secret knowledge from Large Language Models (LLMs) that have been intentionally trained to possess and conceal specific information. The researchers created a controlled testbed with three "secret-keeping" LLMs—Taboo, Secret Side Constraint (SSC), and User Gender—each hiding a different type of fact. They evaluated various black-box techniques, such as prefill attacks and user persona sampling, and white-box techniques, including Logit Lens and Sparse Autoencoders (SAEs), to see which methods most successfully enabled an auditor LLM to guess the secret. The findings demonstrate that both black-box prefilling methods and white-box mechanistic interpretability tools significantly improve the auditor's success rate in uncovering the models' hidden knowledge. The authors conclude by open-sourcing their code and models to establish a public benchmark for future AI safety research in this area.

For the best experience, listen in Metacast app for iOS or Android