AI Control: Improving Safety Despite Intentional Subversion - podcast episode cover

AI Control: Improving Safety Despite Intentional Subversion

Dec 15, 202317 min
--:--
--:--
Download Metacast podcast app
Listen to this episode in Metacast mobile app
Don't just listen to podcasts. Learn from them with transcripts, summaries, and chapters for every episode. Skim, search, and bookmark insights. Learn more

Episode description

Crossposted from the AI Alignment Forum. May contain more technical jargon than usual.We’ve released a paper, AI Control: Improving Safety Despite Intentional Subversion. This paper explores techniques that prevent AI catastrophes even if AI instances are colluding to subvert the safety techniques. In this post:

  • We summarize the paper;
  • We compare our methodology to what the one used in other safety papers.
The next post in this sequence (which we’ll release in the coming weeks) discusses what we mean by AI control and argues that it is a promising methodology for reducing risk from scheming models.

Here's the abstract of the paper:

As large language models (LLMs) become more powerful and are deployed more autonomously, it will be increasingly important to prevent them from causing harmful outcomes. Researchers have investigated a variety of safety techniques for this purpose, e.g. using models to review the outputs of other models [...]

---

First published:
December 13th, 2023

Source:
https://www.lesswrong.com/posts/d9FJHawgkiMSPjagR/ai-control-improving-safety-despite-intentional-subversion

---

Narrated by TYPE III AUDIO.

For the best experience, listen in Metacast app for iOS or Android