The Adversarial Mind: Defeating AI Defenses with Nicholas Carlini of Google DeepMind - podcast episode cover

The Adversarial Mind: Defeating AI Defenses with Nicholas Carlini of Google DeepMind

Feb 27, 20253 hr 35 min
--:--
--:--
Listen in podcast apps:

Summary

Nicholas Carlini from Google DeepMind discusses adversarial machine learning, challenges in defending against AI attacks, and the balance between AI security and accessibility. He explores techniques to break AI defenses, the role of intuition, scaling attack methodologies, and the implications of open-source AI, highlighting the complexities of balancing security with accessibility in emerging AI technologies.

Episode description

In this episode, security researcher Nicholas Carlini of Google DeepMind delves into his extensive work on adversarial machine learning and cybersecurity. He discusses his pioneering contributions, which include developing attacks that have challenged the defenses of image classifiers and exploring the robustness of neural networks. Carlini details the inherent difficulties of defending against adversarial attacks, the role of human intuition in his work, and the potential of scaling attack methodologies using language models. He also addresses the broader implications of open-source AI and the complexities of balancing security with accessibility in emerging AI technologies. SPONSORS: SafeBase: SafeBase is the leading trust-centered platform for enterprise security. Streamline workflows, automate questionnaire responses, and integrate with tools like Slack and Salesforce to eliminate friction in the review process. With rich analytics and customizable settings, SafeBase scales to complex use cases while showcasing security's impact on deal acceleration. Trusted by companies like OpenAI, SafeBase ensures value in just 16 days post-launch. Learn more at https://safebase.io/podcast Oracle Cloud Infrastructure (OCI): Oracle's next-generation cloud platform delivers blazing-fast AI and ML performance with 50% less for compute and 80% less for outbound networking compared to other cloud providers. OCI powers industry leaders like Vodafone and Thomson Reuters with secure infrastructure and application development capabilities. New U.S. customers can get their cloud bill cut in half by switching to OCI before March 31, 2024 at https://oracle.com/cognitive Shopify: Shopify is revolutionizing online selling with its market-leading checkout system and robust API ecosystem. Its exclusive library of cutting-edge AI apps empowers e-commerce businesses to thrive in a competitive market. Cognitive Revolution listeners can try Shopify for just $1 per month at https://shopify.com/cognitive NetSuite: Over 41,000 businesses trust NetSuite by Oracle, the #1 cloud ERP, to future-proof their operations. With a unified platform for accounting, financial management, inventory, and HR, NetSuite provides real-time insights and forecasting to help you make quick, informed decisions. Whether you're earning millions or hundreds of millions, NetSuite empowers you to tackle challenges and seize opportunities. Download the free CFO's guide to AI and machine learning at https://netsuite.com/cognitive RECOMMENDED PODCAST: Second OpinionJoin Christina Farr, Ash Zenooz and Luba Greenwood as they bring influential entrepreneurs, experts and investors into the ring for candid conversations at the frontlines of healthcare and digital health every week. Spotify: https://open.spotify.com/show/0A8NwQE976s32zdBbZw6bv Apple: https://podcasts.apple.com/us/podcast/second-opinion-with-christina-farr-ash-zenooz-md-luba/id1759267211 YouTube: https://www.youtube.com/@SecondOpinionwithChristinaFarr SOCIAL LINKS: Website: https://www.cognitiverevolution.ai Twitter (Podcast): https://x.com/cogrev_podcast Twitter (Nathan): https://x.com/labenz LinkedIn: https://linkedin.com/in/nathanlabenz/ Youtube: https://youtube.com/@CognitiveRevolutionPodcast Apple: https://podcasts.apple.com/de/podcast/the-cognitive-revolution-ai-builders-researchers-and/id1669813431 Spotify: https://open.spotify.com/show/6yHyok3M3BjqzR0VB5MSyk PRODUCED BY: https://aipodcast.ing

Transcript

There are lots of lessons we've learned over the years. One of the biggest ones probably is the simplest possible objective is usually the best one. Even if you can have a better objective function that seems mathematically pure in some sense. The fact that it's easy to debug simple loss functions means that you can get 90% of the way there.

so like the accuracy under attack for the type of episode examples you train on usually is yeah 50 60 maybe 70 and that's much bigger than zero right like you know this is good but as an attacker what does 70 accuracy mean to me 70% accuracy as an attacker means to me I try four times and probably one of them works. The core of security...

is turning this really ugly system that no one understands what's going on and like highlighting the one part of it that like happened to be the most important piece. This is important to do to show people how easy it is because The people who know it's easy are not going to write the papers and say it's easy. Hello, and welcome back to The Cognitive Revolution. Today, I'm speaking with Nicholas Carlini, prolific security researcher at Google DeepMind, who's demonstrated over and over again.

that despite many attempts and tremendous effort, AI systems still cannot be robustly defended against adversarial attacks. My goal in this conversation was to draw out the mental models frameworks, and intuitions that have allowed Nicholas to be so consistently successful at breaking AI defenses. And we cover a ton of ground, including the fundamental asymmetry between attack and defense.

how visualization helps him understand high-dimensional spaces, how adversarial defenses usually work by modifying lost landscapes and the techniques he uses to get around those challenges, how confident we should be in our understanding of the features learned by interpretability techniques like sparse autoencoders, the relationship between interpretability and robustness, the compute requirements for different types of attacks,

How he approached and ultimately quite quickly defeated the tamper resistant fine tuning defense that we previously covered in our episode with Dan Hendricks. How models store and can be made to reveal training information. What makes humans more robust than current AI systems? Whether the black box characteristics evolved by biological systems might be adaptive for security purposes? and the still quite limited role that today's AIs can play in developing Carlini-style adversarial attacks.

Throughout the conversation, Nicholas shares a number of fascinating insights from his observation that almost everything in high dimensional space is close to a hyperplane to his emphasis on starting with the simplest possible loss function to his practical wisdom.

about which defenses are worth spending the time to attack in the first place. At the same time, there's an important meta lesson here about the possibly irreducible black box nature of intelligence itself. Nicholas doesn't fully understand why he's so good at this work. And as you'll hear, he chalks a decent part of it up to an impossible to articulate intuition that he's developed over years of experience.

Now, as we enter into an era in which reinforcement learning is quickly propelling AIs to human or even superhuman levels of capability in more and more domains, we can only expect more Move 37 type insights from AI systems as well. and will face real challenges in determining how much to trust them. This in turn underlies another important theme of this conversation, which is the genuine ambivalence of the AI safety community toward powerful open source models.

It's underappreciated and worth repeating that most AI safety advocates are lifelong techno optimists who, like Nicholas, genuinely fear concentration of power. and appreciate both that open source software has been amazing for the world and that open source AI models specifically have been critical to enabling all sorts of recent safety research.

Yet, at the same time, they worry that extremely capable AI systems are coming soon, and in part because of Nicholas's work, strongly doubt that we'll be able to make such systems safe enough to be distributed broadly in an irreversible fashion. This is a really vexing dilemma, but with AI being deployed in more and more contexts all the time, my hope for this episode is twofold. First,

that highlighting Nicholas's work can help equip policymakers to make informed decisions as they inevitably confront difficult trade-offs. And second, that we might inspire a few talented researchers and builders to meet the market demand and social need. for AI security expertise by pursuing their own version of Nicholas's storied career path.

As always, if you're finding value in the show, we'd appreciate it if you'd take a moment to share it with friends, rate a review on Apple Podcasts or Spotify, or just leave us a comment on YouTube. We welcome your feedback and suggestions too via our website, CognitiveRevolution.ai. or by DMing me on your favorite social network. Now, I hope you enjoy this window into the habits of mind that support successful AI security research with Nicholas Carlini of Google DeepMind.

Nicholas Carlini, security researcher at Google DeepMind. Welcome to the Cognitive Revolution. Yeah, it's great to be here. Thanks for having me. I'm excited for this. I guess, quick context, you recently had an appearance on Machine Learning Street Talk, maybe out 10 days ago or so as of the moment that we're recording. And I thought that was excellent. So shout out to MLST.

for another great episode. Hopefully we'll cover, you know, largely very different ground here, but I do recommend people check out that episode as well for another angle on your thinking and your understanding of everything that's going on in AI. One thing that was said in that episode, which caught my attention, I haven't fully fact checked it, was that you have created, demonstrated, and I guess published more attacks on cybersecurity and...

machine learning defenses than the rest of the field combined. You can tell me if you think that's literally true, but I did look up on your Google Scholar page 21 papers in 2024 alone was what I counted there. Okay, is this literally true? So I think the statement that probably is literally true is...

If you count the number of papers where I am a co-author and the number of defenses broken in those papers, and then you count the number of papers where I am not a co-author, Of the papers that are breaking adversarial example defenses on image classifiers, as of, I don't know, last year, that statement probably was true.

So with caveats, yes, but for a very specific domain, for a very particular kind of thing. And probably mostly just because... this is a thing that i for some reason enjoy doing and just will do before other people get to it and so other people just don't do it as much but yeah that probably is for that one particular claim correct cool well you're uh

careful thinker communicator. What I hope to do maybe above all in this episode is try to develop my intuition and hopefully help other people develop their intuitions for the habits of mind, approaches, mental models, you know, what have you that allowed you to be so successful in this space. So hopefully this can be a little bit of a crash course that maybe inspires some new people to think that they can get into the field and make an impact as well.

So I guess first question is, is everything easy for you to break? Like 21 papers in 2024 alone is obviously a lot. Yeah. Okay. So to be clear, I finished my PhD in 2018. So I've been out for a while. And so I've had a lot of time to meet a lot of great co-authors. And so a lot of the papers that I've been working on... 21 seemed like a lot to me. I was trying to think through how many I can remember. I think a large part of this is for many of these results.

It is the kind of thing where I would show up to the weekly meetings, help write the paper, direct the experiments in some of them, but I was not writing the CUDA code to do whatever stuff myself. And that's how you get a lot of things done. And you see this happen for everyone who's been in the field for a long time, where... the marginal value of an hour of my time could be spent either on

very, very low level stuff with GPUs or with like, here is comments of wisdom I have learned over the past 10 years that helps a PhD student get a lot done in a lot shorter amount of time. This is why people go into faculty. I think the balance for me is that I try also to spend at least half of my time only on papers that I'm technically driving. And so when you say you've had this number of papers, what I think of is like, well, maybe...

Here are like the three papers that I think of as like my papers that like I actually was the person actually doing the experiments. And I could tell you about like every final sentence of what's going on. And those ones have a very strong sense of what's there. And then the other ones are the standard.

a professor who's advising grad students, but I'm instead of being in academia, I'm in industry. And so I advise written help on other students' papers in some ways. Gotcha. You know, across all these things, regardless of your role, was there anything as you look back over the last... year or more that was legitimately very hard to break? Or are you guys basically finding that all of the defenses that the field is coming up with are rather easy for you to break at this point?

In this last year, we didn't spend that much time breaking particular defenses. We have like maybe two or three papers on that. We spent most of our time on other areas trying to understand to what extent attacks are possible, to understand the real world vulnerability of models to certain types of attacks.

to do some general privacy analysis and not say this particular defense is wrong, but for all neural networks trained with gradient descent here is an interesting property about the privacy of them. You have a lot of these kinds of results that are not. really detailed focused on breaking one particular thing. I think last year I maybe only had two papers that were particularly on breaking things.

One was early in the year we wrote, I had a paper that was, there was a defense published at IEEE SMP, which is one of the top conferences in the security field, which was an adversarial example defense. And this paper I broke. and this one turned out to be relatively easy i don't know an hour or two oh gosh this one was sort of abnormally easy but it's okay maybe not that abnormally so

Yeah, I think probably if you take like adversarial example defenses on image classifiers are a particular beast that I have gotten very good at. And the attacks are relatively well understood. And there are lots of known failure modes. And so when I'm doing this, I'm not developing new science. I'm just like going through like...

I have these long list of things I've broken before. What's the pattern that this one falls into? OK, here's the pattern. It turns out that the gradients are not flowing because the softmax is saturated to 1. What do you do? Make sure the softmax doesn't saturate. Therefore, you find that you can break it, and it works very, very quickly.

And so that's what I did for that paper. You know, very much just an engineering kind of result of why is the softmax giving the gradients are identically zero. And once you figure out the answer is because of some discretization or whatever the case might be, then everything is easy for there.

The other paper that was more interesting maybe is this paper that is one of these advising papers where I didn't do any of the technical work, but was helping a couple of students think through what it means to consider the robustness.

of instead of adversarial example defenses, which are these test time evasion attacks where you like perturb the image a little bit and it turns a picture of, I don't know, a panda into something else. Instead, we were looking in this paper at what are called unfine-tunable models. which are these models that are designed to be ones you can release as open source. The weights are available to anyone and they're supposed to be not possible to be fine-tuned to do other tasks.

And the particular concern that these defenses were looking at is you would ideally want to make sure that no model that I trained is going to be helpful to make someone be able to produce bioweapons or something, whatever the threat model is you're thinking about. You can make it so that there's some safety in your model initially, but if you release the model's open weights, then anyone can fine tune it and remove the safety filters that you've put in place.

And these unfine-tunable models are supposed to be designed to be not only robust to these kinds of initial adversarial example types of attacks, but also robust to someone who can perturb the weights. And so in this paper, there were a couple of students who were doing a bunch of work on attacking these models to show that you actually can still fine tune them, even though they've been trained to be unfine tunable. And a bunch of the thoughts that we've had.

in the last five, ten years on adversarial examples went into this, the same kinds of lessons, but a bunch of the techniques were very different. And so the students had to spend a bunch of work like actually getting this to work out. So I want to dig in on that one in particular, because that, I agree, strikes me as one of the most important and interesting cat and mouse games going on in the space right now. Before zooming in on that, though, you know, you said like...

When I see something new, I sort of have this like Rolodex of past things and paradigms that I can quickly go through. Could you sort of sketch those out for us? Like, how do you organize the space of attacks? You know, is it a... hierarchy or some sort of other taxonomy. I'd love to get a sense for sort of what your mental palace of attacks looks like. Okay.

Let me separate off this one space of attacks, which are these new like a human typing in a keyboard prompting the model to make it say a bad thing. Let's put aside for the second these kinds of treating the model as a human and trying to social engineer into doing something bad. So you put that aside.

then almost all attacks are the way that you run the attack is you try and do some kind of gradient descent to make some particular loss function maximized. So for image adversarial examples, what does this mean? I have an image of you know, a stop sign. I want to know what sticker can I put on the stop sign to make it be recognized as a 45 mile an hour speed sign. How do I do this? I perform gradient descent to maximize, like compute the optimal sticker so that the thing becomes misclassified.

Or in the case of poisoning, where the poisoning is you modify a training data example point in order to make the model produce an error. You're trying to optimize the particular... poisoned data point you have in the training data set so that the model makes a mistake. Or in the case of this unfine-tunable models, you have a model.

that you want to make sure no one can edit. And so you try and find a way to perform gradients on the model to update the parameters so that it can perform some bad thing. And so in all of these attacks, there are essentially two things you need to concern yourself with. One is what is the objective that you're maximizing or minimizing? Like what is the specific loss function you're using? And the other is what is the optimization technique that you are using to make that number go up?

And both of these are the two things you can play with. And by coming up with the best possible versions of each of these, you end up with very strong attacks. And so a big part of doing these kinds of attacks when you're doing this gradient based optimization thing. is coming up with high-quality functions that you can optimize and coming up with high-quality optimizers. And there are lots of lessons we've learned.

Over the years, I mean, one of the biggest ones probably is the simplest possible objective is usually the best one. Even if you can have a better objective function that seems mathematically pure in some sense. The fact that it's easy to debug simple loss functions means that you can get 90% of the way there in doing these attacks. And the last little bit is a lot... it's nice to go from 95% to 98% of tax success rate, but like, it's not.

really necessary in all of these ways and so you pick a really simple loss function that's easy to formulate easy to understand why things are going wrong and you pick an optimizer that makes sense and mostly things just work a lot of the i don't know how much, but like a significant amount of this work over time has been in this image classifier domain. And a lot of times we see like pretty striking examples there where

I guess there's either a second term in the loss function or some sort of budget constraint as well. You're both trying to say, okay, I've got a picture of a car and I want to make it output dog as the classification or whatever. But then you also don't want to actually change the image to a dog to make that happen. So how often is it this...

Second term is also a big part of keeping the image looking like it originally did. Sure, yeah. So in case of adversarial examples, the way... So one of my first papers in adversarial machine learning was coming up with a clever way of doing this.

so yeah this was entirely a paper on how to do these exact two questions what's the optimizer what's the optimization objective and yeah we did some some clever thing and it worked like well i won't go into details here but like we did something fancy And then like six months later or something, maybe a year later, Alexander Medri and his students said,

Instead of doing something clever, let's just bound the image to be in this sort of small ball around the initial point. So you can only perturb the three lowest bits. and only optimize the objective function I said is a good optimization objective, and run the same optimization algorithm I was using. It turns out it gets you like 99% of the way there, and it's so much simpler. This algorithm is called PGD, and this is the one everyone remembers because it's the right way of doing it.

Like, you know, you can squeeze epsilon more performance out of it if you do things like a lot fancier. But the defense is either effective or it's not for the most part. Breaking the last 2% is very rarely something you actually need. And so for the most part, yeah, it's entirely fine to just say like... Let's make something a lot simpler and optimize that thing and ends up working quite a lot better. And so, yeah, so for these image examples today, people don't put a second.

term on minimizing the distance between the original image and the other one they just add as a constraint they just say i just constrain you to this bounding box you can only change the lowest three bits of the pixels and this just makes the optimization so much simpler and it's

a little bit worse, but like in all practical senses, it just makes things work a lot better. Hey, we'll continue our interview in a moment after a word from our sponsors. Great news. Your sales team just landed a major new enterprise client. Bad news, service can't start until you've completed their 20-page security questionnaire. If this sounds familiar and you're tired of wasting senior engineering time on this problem, you should check out SafeBase.

SafeBase is the leading enterprise-ready, trust-centered platform, and it allows you to offer your customers centralized, secure, self-service access to the answers they need when reviewing your company's security and trust policies. With Safebase, you can streamline workflows and questionnaire responses with a powerful AI automation. Eliminate friction in the review process with robust integrations into Slack, Salesforce, Teams, and more.

showcase security's impact on deal acceleration and compare your team's performance to your peers with rich analytic dashboards, and grant the right people the right access at the right time. with customizable settings that are ready to scale with the most complex use cases. Plus, you'll get White Glove customer service from onboarding to optimizing to ensure that your team sees value in as little as 16 days post launch.

Trust is your competitive advantage, but it rests on a combination of speed and tangible proof. CY leading companies, including OpenAI, choose SafeBase to take a proactive stance and prove their commitment to security for their customers. Learn more at safebase.io slash podcast. Even if you think it's a bit overhyped, AI is suddenly everywhere.

From self-driving cars to molecular medicine to business efficiency. If it's not in your industry yet, it's coming and fast. But AI needs a lot of speed and computing power. So how do you compete without costs spiraling out of control? Time to upgrade to the next generation of the cloud. Oracle Cloud Infrastructure, or OCI. OCI is a blazing fast and secure platform for your infrastructure, database, application development,

plus all of your AI and machine learning workloads. OCI costs 50% less for compute and 80% less for networking, so you're saving a pile of money. Thousands of businesses have already upgraded to OCI, including Vodafone, Thomson Reuters, and Suno AI. Right now, Oracle is offering to cut your current cloud bill in half if you move to OCI for new U.S. customers with minimum financial commitment. Offer ends March 31st.

See if your company qualifies for this special offer at oracle.com slash cognitive. That's oracle.com slash cognitive. One obvious way to read that is just that you only have to succeed with a minority of your attacks, whereas for defense to be successful, you've got to win always or near always. Are there other kind of... meanings of that or intuitions for attack is easier defense that are important as well? Yeah. So this is the big one. The second big one is the attacker goes second.

So the defender has to come up with some scheme initially, and then the attacker gets to spend a bunch of time thinking about that particular scheme afterwards. And so this is maybe a variant on why finding one problem is easier than solving all of them. But the particular thing is, it probably would be pretty hard for me to write down an attack algorithm that was effective.

against any possible defense there's like almost certainly something that someone could do that is correct that like stops all attacks but like i don't have to think about that defense i only have to think about the defense that's literally in front of me right now and so it's a lot easier when you're presented with one particular algorithm, you can spend six months analyzing it. And so the attacker has just an information advantage from this side, too, where they can...

wait for the field to get better, to learn new things, and then apply the attack after all of this has been learned. And the defender, in many cases, can't update the thing that they've done. There are some settings where this is reversed. where the attacker has to go first poisoning for example can be one of them suppose that i want to make malicious training data and put it on the internet and hope that some language model provider is going to then go and train on my malicious data

In this case, it may actually be the case that the attacker has to go first. I have to upload my training data. And then someone gets to train their model with whatever algorithm they want, with whatever defense they want to remove my poisoned data, before they actually run the training. in this case maybe the defense is actually a little bit easier than the attack it's hard to say because of this defender now goes second but like for many of these cases that

I've spent most of my time thinking about the example case, this recent unfine-tunable models case. It is the case that the attacker goes second, and that really gives them a lot of power. Yeah, I wonder what that implies for the future of...

how open all this work is going to be, right? I mean, we've been in a regime where the stakes of machine learning generally were not super high and... you know people were kind of free and easy and publishing stuff including and i've always kind of marveled at this from you know the biggest companies in the world which you know you one might wonder like why are the biggest companies in the world like publishing all this ip but they've been doing it

Now it seems like maybe, geez, if we're actually running an API at scale, maybe we don't want to disclose all of our defense techniques. So do you think that's already changing? You already see this, right? GPT-2 was released with the weights. GPT-3, GPT-4 was not. The biggest models are not being, for the most part, released by the companies who are doing this. I think security is probably a small part of the argument here.

I will say, though, there are other areas of security or in almost all other areas of security, this is not what we rely on. Let's think, for example, about cryptography, right? Like we publish algorithms. Everyone knows how the best crypto systems work. Everyone tries to analyze them. No company in their right mind would ever try and develop a new fancy crypto system. You're just going to use AES because it's known to be good. It would be crazy to try and do anything fancy in-house.

The reason why is because empirically it works very well, and we've had the entire community try and break it for 20 years and have largely failed. And so everyone believes that this is effective. And you don't get that same kind of belief in something without... a large number of people trying to analyze it. And so if you have these models and they stay proprietary things that are not disclosed,

It may be the case that empirically this just ends up in the best we can hope for. Maybe just deep learning is impossible to secure. There's no hope at it. You lock things down and you try and just change things faster than the attackers can find bugs in them. And like, okay, like that would not be great, but like, you know, I think we can potentially live in that world. I think what would be a lot better, which just may not be happening, maybe very hard, is...

You get everyone to disclose exactly what they're doing, exactly how they're doing it. You get everyone to analyze that in detail. And then you learn how to make these things better to some extent that you can actually improve robustness. And then you get to the point where People can choose to either release things or not release things, not because of security, but because, I don't know, they want to make money or whatever the case. But I think what I would like to avoid is the...

the belief that not making the thing public is the more secure version because it's a shame that this is part of the thing that goes into this right now. But I would rather have things that just actually work as opposed to things that...

are insecure but we just like like lock them down and just make it harder to find the bugs because those they're still insecure they're just like a little bit harder to find the bugs on what let's come back to that in a little bit as well just staying for a moment on kind of how you organize the space of all these different attack regimes and whatnot. There are some settings. In fact, we did a whole episode on the...

quote-unquote universal jailbreak, which I hadn't even realized until preparing for this that you were a co-author on that. That was one of the many papers from the last couple of years. But there are some sort of wrinkles on the high level. description that you gave of find a gradient maximize some loss function where for example in that universal jailbreak paper if i recall correctly because the idea was limited to picking the right tokens

the space isn't like purely differentiable. And so you're kind of like navigating this sort of discrete space of individual. Yeah. Let's talk about this paper for a second then. So as a refresher for everyone, what this paper is doing. This is, I guess, again, one of these papers I was mostly just advising with Zico and Matt and the students found out that it is possible to take a language model.

that usually would refuse answers to questions. So you ask, how do I build a bomb? And the model says, I'm sorry, I can't possibly help with that. It is possible to take that same model and append this adversarial suffix to the model. so that you can arrange for the model to now give you a valid answer. How do you do this? Because if I knew the answer ahead of time, one thing you might imagine doing...

is trying to optimize the tokens. We'll come back to this optimization question in a second. Let's just assume you can optimize. You can imagine trying to optimize the tokens so that the model gives a particular response as output. You know, here are the steps to build a bomb. One dot, you know, go get whatever chemicals you need. Two dot, the instructions to assemble them, or whatever the case may be. But this requires, I know the instructions already, so it's not very helpful.

So what's the objective function that I'm actually going to use to make the model give a response? Well, maybe another thing you can think about is you could try and come up with some fancy latent space non-refusal direction and do some optimization against this.

and actually there's been now recently some work on actually doing that but again like this is this is complicated it's not the first thing you want to try what's the first thing you want to try the first thing you want to try comes from i think initially maybe a paper by jacob steinhardt at least that's the first paper i saw it from What we wrote in this paper, we said an affirmative response attack, which just says, let's make the model first respond, okay, here's how to build a bomb.

That's the only objective. The only objective is make the first 10 words from the model be an affirmative response that says, yes, sure, I will help you build the bomb. And then once you've done that, because of the nature of language models... it turns out that they then give you an answer. And there are other defenses that rely on breaking this assumption too. But this was the key part of the objective function, is how do we take something that...

We want something in our mind. We want the model to give us an answer that gives us answers the instruction for something. But actually coming up with a particular number that makes this happen is very hard. And so we come up with this very straightforward loss function objective that makes that happen. Now we can return to the question of what is the optimizer? And this is, again, where a lot of the work in this paper went to, is how do you take something that is, as you say, discrete tokens...

and make it be something that you can actually optimize. And early work had tried to do like second order gradients and some fancy stuff going on there. And the main thing that this paper says is we will do...

Maybe three things. First, we will use gradients to guide our search. We're not going to use gradients for the search. It will be to guide the search. We will check whether or not the gradients were effective by actually switching tokens or out. And then we will spend a lot more compute than other people were doing.

you know bitter lesson and just do this a bunch and you end up with very very strong effective attacks and so this i think even still does very fall nicely into this what are you optimizing and how are you optimizing it So how much compute, maybe this is another sort of dimension of how you would think about this. How much, you know, resource does this take? If you're doing like one of these, you know, gradient.

things, how much do you typically have to put into it? If you're doing something that's in a discrete space and requires more of a structured search, how does that compare? If you're doing data poisoning, how much data does it take to actually poison a model? Sure, okay, I'll do these maybe one at a time. So let's start in the image adversarial example continuous space question. The amount of compute here is like almost zero.

One of the first papers that showed this is a paper by Ian Goodfellow, where he introduced this hack called the fast gradient sign method. The fast gradient sign method does exactly two things. Well, first of all, it's fast. And the reason why it's fast is because what it does is it takes an image.

it computes the gradient with respect to the image pixels and then computes the sign just like literally take the sign which direction does the gradient say and then take a small step in this direction that's it one step so if a model is vulnerable to this fast gradient sign method, then it takes exactly one gradient step, which is essentially zero time. Other attacks, like I mentioned PGD already, PGD you can think of essentially as fast gradient sign.

but iterated some number of times. The number of iterations is, I don't know, usually, let's say, somewhere between 10 and 1,000. For undefended models, it could be 10. For defended models, you know... To break it, for the most part, you need usually, I don't know, 10 to 100. And just out of care, just to make sure you're not making any mistakes, it's often a good idea just to use 1,000, just to make sure you haven't excellently unoptimized enough.

And then this works very well. How long does 1000 iterations take? I don't know, a minute or two for like reasonable sized models. Now let's go to this discrete space for GCG, where generating an attack can take an hour or maybe a couple hours depending on what you want because for this like we're doing some large batch size we're doing a thousand mini batch steps it takes a relatively large amount of time but not

a huge amount of time. So it's still something that's like much, much, much faster than training or just a magnitude faster than training. But by going to the discrete space, it does become a lot slower. Hey, we'll continue our interview in a moment after a word from our sponsors.

The Cognitive Revolution is brought to you by Shopify. I've known Shopify as the world's leading e-commerce platform for years, but it was only recently when I started a project with my friends at Quikly that I realized just how dominant Shopify really is. Quickly is an urgency marketing platform that's been running innovative, time-limited marketing activations for major brands for years.

Now, we're working together to build an AI layer, which will use Generative AI to scale their service to long tail e-commerce businesses. And since Shopify has the largest market share, the most robust APIs, and the most thriving application ecosystem, we are building exclusively for the Shopify platform.

So if you're building an e-commerce business, upgrade to Shopify and you'll enjoy not only their market-leading checkout system, but also an increasingly robust library of cutting-edge AI apps like Quickly, many of which will be exclusive to Shopify on launch.

Cognitive Revolution listeners can sign up for a $1 per month trial period at shopify.com slash cognitive, where cognitive is all lowercase. Nobody does selling better than Shopify. So visit shopify.com slash cognitive to upgrade your selling today. That's shopify.com slash cognitive. What does the future hold for business? Ask nine experts and you'll get 10 answers. Bull market, bear market.

Rates will rise or fall, inflations up or down. Can someone please invent a crystal ball? Until then, over 41,000 businesses have future-proofed their business with NetSuite by Oracle, the number one cloud ERP. bringing accounting, financial management, inventory, and HR into one fluid platform. With one unified business management suite, there's one source of truth, giving you the visibility and control you need to make quick decisions.

With real-time insights and forecasting, you're peering into the future with actionable data. When you're closing books in days, not weeks, you're spending less time looking backward and more time on what's next. As someone who spent years trying to run a growing business with a mix of spreadsheets and startup point solutions, I can definitely say don't do that. Your all-nighters should be saved for building, not for prepping financial packets for board meetings.

So whether your company is earning millions or even hundreds of millions, NetSuite helps you respond to immediate challenges and seize your biggest opportunities. And speaking of opportunity. Download the CFO's Guide to AI and Machine Learning at netsuite.com slash cognitive. The guide is free to you at netsuite.com slash cognitive. That's netsuite.com slash cognitive.

And then how about on the data poisoning side? Yeah, okay. So there's a question of how much time it takes to generate this data. And there are basically two rough directions here. So the field initially started out, how do I make a model give the wrong answer? I add a bunch of data that's labeled incorrectly. This is like the simplest possible thing you can do. This is some paper by Batista, which...

got a test of time award at ICML a couple of years ago. It's a very nice paper from, I don't remember when, 2012 or something. It's like, it's a very, one of these very early security results that's very important. And yeah, so you just insert mislabeled data. This is like,

It's very easy to do. You insert a very small amount of mislabeled data. And like these image classifiers at the time, they were looking at MNIST classifiers, would just immediately mislabel the data. Then people started looking at, well, what happens if the adversary can't just insert mislabeled data?

Right. Because like, you know, once upon a time, we used to curate our data sets to like the only high quality data. And so it would be unreasonable to suspect that the adversary could just inject mislabeled data points. And then.

The answer is, well, now I have to be very, very careful. I have to optimize my images to look like they're right. There's this clean label poisoning threat model that you need to do some fancy stuff. Try and imagine what the embeddings you want the classifier to learn are.

and you surround your test point in an embedding space and do some fancy polytope stuff, and there's a bunch of work that does fancy stuff here. And the optimization is relatively difficult, and you need 1% poison to data. This is a lot.

And then people started going, well, why do we label, why do we clean our data in the first place? Let's just like take all the data from the internet. And again, poisoning becomes a lot easier then. You know, if you're willing to just like take arbitrary data from the internet, now you can just mislabel your data points again. And so we had a paper, I don't know. in 2021, looking at poisoning some of these self-supervised classifiers like Clip and others.

that you just add mislabeled data points again and the thing basically just breaks and you don't need to do anything fancy, no optimization, you just flip the label, you add a couple hundred images and you can get these kinds of things to work. There's a new question now of how this works for language models.

And this is one of the things that we've been writing papers on recently, is to try and figure this out. I feel like we don't understand this right now because a bunch of things are different for language models. For example, no one just uses the base language model. You have your language model, and then you go fine-tune it with SFT and RLHF, and you change the weights, and so you need your poisoning to be robust to all these things.

Yeah, this is another paper I helped advise some students on from CMU and from Zurich where Javier and them were looking at trying to understand what actually happens in the optimization after you have poisoned the model. So you have to arrange for the model to be poisoned in such a way that even after RLHF it still gives you the wrong answer. And doing this is challenging.

And so it ends up right now that the poisoning rates are something like 0.1%, which is small, but like 0.1% of a trillion tokens is a billion tokens. So if you were to train a model on just some large fraction of the internet, this... could potentially be infeasible for an adversary to do in practice. Now, my gut feeling is that this has to be too big because models know more than a thousand things.

If you had to have control of one thousandth of the data set to make the model believe something is true, they could only know a thousand things. And so this just doesn't make sense. And so there has to be some better poisoning way to make the model be able...

to be vulnerable to poisoning that works with much lower control of the training data. But this might now need fancier algorithms again. You might need to come up with clever ways of constructing your data that's not just like repeating the same false fact lots of times. So again, I don't know. I think this is...

One of the open questions we've been trying to write some papers on recently, and I hope we'll have a better understanding of sometime this year. One thing that you said that really caught my attention was... You have to kind of imagine what the embeddings would be like as you were trying to think of an attack. So can you unpack that a little bit? I would love to know, are you visualizing something there or?

Cause I struggle to have good intuitions for this as evidenced by my previous enthusiasm for the tamper resistant fine tuning. I was like, Oh, this is amazing. You know, it seems like this could really work. And clearly I'm not doing something there as I conceive of that, that you are doing.

It might be hard to communicate what that is, but what do you think you're doing? Okay, so this paper was not mine. This was a paper, I think this was not... so there's a poison frogs paper and this was a follow-up i think it was called polytope attack but this was a long time ago and so i don't remember i think it might have been tom goldstein's group again

I don't remember the details. To abstract from the details. The real hope is that I can sort of grasp onto something that allows me to be better at this in the future. So this paper... The idea was very simple. Let me explain what this paper is trying to do. It's trying to make a particular image become misclassified. And it's trying to do this in such a way where it does not introduce any...

large label noise to the training dataset that any person would look at and say, that's obviously wrong. And so what it tries to do is it tries to surround the image you want to become misclassified in embedding space, like in this high dimensional embedding space. with other images that have the opposite label, but make those images be close in embedding space to the target one you're trying to misclassify. And so it tries to take that and it tries to pull the entire region of that space.

over to the region that those images should be. And so the idea is relatively simple, sort of like trying to put a box around the image you want to become misclassified so that the entire box is labeled the wrong way.

instead of it being labeled the correct way. For me, I guess, for many of these attacks that I try and think about, I tend to think about them visually, but I think ignoring the details is like is entirely fine well like i'm just trying to get a sense of like what's the important thing that's going on here and like what at a high level makes sense that should be true like what what should morally be true and then like

Then you can figure out the details afterwards. But after you figure out what should be true about this, then the rest is implementation. If you ask math people how they do these proofs, it sounds similar when they talk about it.

establish what should be true in their mind and then you go and try and prove it and it turns out you know maybe the proof has to be more complicated or something didn't work out in the details and then you try something else that like feels like it should be true and this is i guess a similar thing i try to do and i don't know how to

give intuition for this like feels like it should be true other than just like you've done it a bunch and you look at it and this looks spiritually similar to this other thing that broke in a very similar way i feel like the ideas should carry over

We'll come at this procedurally as well in a second, but just staying on the visualization, are you doing the classic physics thing of visualizing in three dimensions and then saying N really hard? I wouldn't say I'm good at this at all, but I sort of... have a certain version of this for refusal where I kind of imagine like,

a fork in the road or like branching river or something where it's like once you're on one path then there's a lot of you know you're in some local well where just like when a river has forked right it's not going to meet again until it's down into some other topology or geography or whatever i mean that's pretty hackneyed but you know what's your version of that if you can i i don't know that i have a great version of this that i can really give you

i feel like everyone thinks of things differently i tend to try and think of these things visually for what's going on and yeah i do the let's think of three dimensions and then just like imagine that things roughly go like this but this can be really deceptive because There are so many defenses that are predicated on the belief that things are working in three dimensions. And then you go to like a thousand dimensions and all of a sudden nothing works anymore.

Because you learn to become used to certain facts in high dimensions when you're attacking things. Almost everything is close in high dimensions to a hyperplane. If you just draw a plane and pick a point, they're almost always close. And so you can think about...

lots of things they'll try and separate points from planes but like in high dimensions it's almost always close like you don't have to think about the details and you know like lots of these things intuitions we have in three dimensions just don't work in higher dimensions and you just

you become used to the idea that which of these intuitions are wrong, and you don't need to understand exactly why they're wrong. It's a thing you learn is true, and when someone justifies their defense using one of these things that you've seen that doesn't make much sense.

You then just go, okay, well, presumably there's something here that I should look at more. So that's an interesting kind of rule of thumb or mental model right off the bat. Everything is close in high dimensions. Is there a good story for why that is? I mean...

It doesn't seem like it holds in two dimensions, right? If I take, if I understand you correctly, if I'm in three dimensions and I draw a two-dimensional plane in it, then I would intuitively feel like some things are close to that and some things are far from that. if I'm in a thousand and I draw like a 999 dimensional plane, if that's what I, if I'm understanding you correctly, like why is everything close to that?

yeah okay so maybe the statement that i will make to be more precise is let's suppose that you have some like classification model and you have some decision boundary like of the classifier the statement that is like that is true is like almost all points are very close to one of the decision boundaries because you know both there are many of them but also in high dimensions

I may be very far from something in almost all directions, but there exists a direction that I can travel in that is the direction orthogonal to the closest hyperplane, where the distance is very, very small.

And so you have this thing where if you try in random directions, you may just go forever and never encounter a decision boundary. You probably will at some point, but it will be quite far. But in high dimensions, because of the number of degrees of freedom that you have... it's much more likely that there exists a direction that guides you to some plane that's like really close by that you would just have a hard time finding out.

if you just searched randomly. Whereas in three dimensions, if you search randomly, you're probably going to run into whatever the nearest hyperplane boundary is. In one dimension, you're certainly going to. You just try twice, you go left, you go right, and you find it.

In two dimensions, you go randomly, and maybe most of the time you find something that's close by. In three dimensions, there's more ways you can go that are orthogonal. In two dimensions, there's only two directions you can go that's orthogonal to the line. In three dimensions, there's now an infinite number you can go that's orthogonal to the line.

And so in general, in high dimensions, almost all vectors are perpendicular to each other in high dimensions. And so you can end up almost always just randomly picking directions that just don't make any progress.

which does not mean that there isn't a direction that does make progress. It's just much harder to find it. But once you find it, things mostly just work out. And so maybe this is the more precise version of what I'm trying to say is things are close, but when you search for them randomly, it looks like they're far away.

okay that's quite interesting i wouldn't say i've grokked it just yet but this is the kind of thing like i'm not being formal here i'm not giving you some proof of what i'm saying is correct because like this isn't how i think about it like i sort of just like i think about it sort of very unrigorously in this way and then once you have to go actually go do the attack now you have to think about rigorously but like when just like visualizing what's going on i feel like

Some people try and actually think carefully about what's going on in this thousand-dimensional space. I'm like, I don't know what's going on. I just sort of have my intuition of what feels like is going on. And this sort of roughly matches how things have been going. And you have to be a little bit fuzzy when you're thinking about this because no one can understand it. And then once you're done thinking about that.

you can go back to the numbers and start looking like okay mechanically what's going on you know i'm taking the dot product of these two things and i want this to be you know equal to negative one and so you're going to do some stuff there and like you can become very formal when you need to but yeah i think

being confused in high dimensions is probably the right thing and you get used to the fact that this is this the way that this works and this again is part of the reason why attack is easier because if you're going to defend against things you really need to understand exactly what is going on.

to make sure that you have ruled out all attacks but as an attacker i can have this fuzzy way of thinking about the world and if my intuition is wrong the attack just won't work and i'll then think of another one as opposed to having to have a perfect mental model of what this thing is doing

to make sure that it's robust from all angles. But it does seem like your intuition is a pretty reliable guide to what's going to work. Yeah. But I guess my... a predictor which is almost as accurate as me would be to say does this work answer no which is like basically most of my intuition just says is like no this doesn't work maybe the thing that's maybe i'm a little bit better at than some people is um

Why does it not work? Like, what would the attack be that breaks this? And I think that is just having done this a lot for many different defenses and having seen all of the ways that things can fail and then you just remember this and you pattern match to the next closest thing, you know. why is it that people who do math can like prove things with very easy ways that seem complicated it's because

They've spent 20 years studying all these things and they've seen an exactly analogous case before and they just remember the details and they abstract things away enough that it becomes relatively straightforward. And I feel like it's mostly an exercise in having practiced doing this a whole bunch.

What would you say is your like conceptual attack success rate? I don't mean like the rate at which examples succeed in attacking within a given strategy, but like how many strategies do you have to come up with?

before you find one that actually does work to break a defense for a given new defense? I don't know. I think it really depends on which one you're looking at, where sometimes you try... you try five things that you think ought to make sense and they don't work and then you try the sixth one and it does i don't know i feel like usually if if you've exhausted the top like

five or ten things and it hasn't you haven't gotten a successful attack then you're not going to get one like or at least for me like i feel like if it's not in the top five or top ten then like it's usually i can't think of something else and probably i don't know for

Ones where for image classifiers in particular, where I've done a bunch of this, usually top one, top two ideas work. For other areas, it takes more just because you've seen fewer examples like this and you don't know what the style of attack approach needs to be. But it's very rare, it sounds like, that you get past, like, ten ideas and give up. Yeah, but also there's some problem selection here where, you know...

Okay, so there's a large number of defenses in image adversarial examples, which are basically just adversarial training changed a little bit. So adversarial training is this one defense approach, which just trains on all the adversarial examples. You know, bitter lesson.

What do you want? Robustance adversarial examples. How do you do it? You train on adversarial examples. You do this at scale and the thing works. And there are lots of defenses that just are adversarial training plus this other trick, you know, plus diffusion models to generate more training data.

plus this other loss term to make it so that I do better on the training data, plus, you know, like, whatever, some smoothing to make the model better in some other way. And, you know, I just basically just believe most of these are probably correct for the most part.

and so i just won't go and study those ones because like the foundation is something i believe in already and so you just like you don't need to go and in studies rigorously i'm like maybe you could break them by a couple percentage points more but like it's like not going to be a substantial thing that like is worth doing a lot of time on what i tend to spend my time looking at are the things that when you look at them they do look a little more weird

And that's those are the more interesting defenses because they're like a qualitatively new class of way of thinking about this. And so I want to think about it. I think these ones are worth spending time thinking about. But this also means like this artificially inflates the attack success rates.

because I'm biasing my search for the ones that I feel like I have a good prior probably are not going to be effective. And so, yeah, it ends up in this way. Just to make sure I'm accurate in terms of my understanding of the space. There are no real adversarial defenses that really work in the image classification. Yeah. Okay. So it depends on what you mean by works. So, right. Okay. So the best defenses we have are basically adversarial training.

Which is, yeah, generate adversarial example, train on adversarial example to be correct, repeat the process many, many times. Okay, what does this give you? This gives you a classifier so that on the domain of adversarial examples you trained on... As long as you don't want to be accurate more than half of the time, you're pretty good. So the accuracy under attack for the type of episode examples you train on usually is 50%, 60%, maybe 70%.

And that's much bigger than zero, right? Like, you know, this is good. But as an attacker, what does 70% accuracy mean to me? 70% accuracy as an attacker means to me, I try four times and probably one of them works. So from that perspective, it's terrible. It doesn't work at all because imagine in system security that you had some defense where the attack was try four different samples of malware and one of them evades the detector. This is not a good detector.

But in image episode examples, this is the best we have. So on one hand, it's much, much higher than zero. Very good progress. On the other hand, 70% is very, very far away from 99.999. But in... machine learning land like you never get five nines of reliability and so 70 is a remarkable achievement on top of zero and so this is i think why you

can talk to someone and they can tell you that it works and you can talk to someone else and you can tell you that it doesn't depending on how you're looking at it it can mean two different things yeah gotcha are there any other like spatial heuristics that You think about, I was in the context of the one where you said to kind of envelop the one example that you want to break in these sort of adversarial examples.

Another shout out to MLST. There was just another episode trying to understand the behavior of models through this like spleens paradigm. And I could imagine, although I'm not like... mathematically sophisticated enough myself to have a good intuition for maybe there are certain rules where it's like you can't create a donut in the internal space of the model and so is that like why that works or you know but

You can address that specifically, but I'm more interested in kind of, do you have a number of these sorts of things where you're like, well, I know that the space kind of is shaped this way, or it's impossible to create this kind of shape in the space, so therefore I can kind of work from there. Yeah. So I feel like...

I don't intend to do so much visualization in this kind for these defenses. I think for the most part, what I'm doing is trying to understand the shape of the loss surface. It's like most of the time. when something is robust to attack or appears robust, the problem is that they have made the loss surface particularly noisy and hard to optimize. And this is what we've seen for visual examples.

essentially forever. Like one of the very first defenses, to add to some examples that people gave serious consideration, is this defense called distillation as a defense. And so, okay, so it had, maybe there's another lesson of these defenses. Defenses often have...

an intuitive reason why the authors think they work and they tell some very nice story. So this defense told some very nice story about you have distillation, you have a teacher model and the teacher sort of teaches the students to be more robust in some nice way.

And that's why the student is robust. And the story of what they're telling themselves of why these things work is often very, very different from what the reason why of why the attack fails. And it turned out that distillation had nothing to do with this defense whatsoever.

It turned out that what was going on is because of the way that they were training this model, they were training the students to have a very, very high temperature, which means the logits were getting very, very, very large. And they were running this. in a time of, you know, TensorFlow zero, when it was very easy to make gradients mean that cross max soft entropy would just like actually give

numerically zero as the output. And so the reason why the attacks were failing is because the loss function was actually identically zero. And so this was like the very first example of one of these kinds of gradient masking defenses. where what's going on is they think that they have some clever idea of what's going on but actually it turns out that the gradient of this function has just been made zero and all I need to do to attack it is

For example, this one failed if you just did the gradients in floating point 64. You get enough signal that everything works out. That one would have worked there. But you could also do other tricks by just dividing the logits before you put them in the softmax. There's lots of things that work here. But then the next generation of defenses were much more explicit about this and had other ways of breaking the gradients.

So there were a bunch of defenses that, like some of them were very, very, very explicit. Like we're just going to like add noise to the model in order to make it so that the gradients are ugly. And then most of what you're trying to think about when you're visualizing this is. How do I make it so that the gradients end up being something that even if they look ugly, still I can work with in some smooth way? And so you can, for example, use this thing called a straight-through estimator.

and make gradients become nicer for discontinuous or ugly objective functions. And there's all these things you can do to visualize how I make the gradients of this very ugly thing look much cleaner. And yeah, I have...

This image that I use in my slides a bunch that shows a very nice visualization in three dimensions of what the gradients for what many of these models look like. And it looks like the surface of some very, very, very ugly... mountain that like is very very hard to actually to do anything with and you know if you run fancier attacks you can like smooth this out into sort of a nice smooth surface that that you're thinking of

gradient descent as like a ball rolling down a hill, you want the hill to be nice and smooth. And so this is what I'm usually trying to think about in high dimensions is this, like, what does this gradient function look like? And yeah, this continued even all the way through to these unfine-tunable models.

where one of the papers for this unfine-tuneable model thing was explicitly saying, we make the gradients very challenging, and we make it so that when you train the model, the gradients are ugly, and so as a result, you can't fine-tune the model because the gradients are challenging. And this is like literally the exact same argument that people were presenting in 2017 for image adversarial examples. And it fails in the exact same way of you.

You change the learning rate a little bit, you add some random restarts, and you add some warm-ups so that things work a little better. The gradient ends up becoming smooth enough that you can now do optimization, and then deep learning takes over and the rest is easy. And so this is, again, like... The same intuition here breaking these other class of defenses. Was that Sofan that you were referring to there? Yeah, so this one was both the rep noise and tar.

that have some arguments about what's going on here. Noise makes some arguments about the activations becoming noisy and that's why you can't do things. And there's other paper called TAR that also adds some adversarial training to the process. But one of the very first things that we learned in adversarial training is you have to train against a sufficiently strong adversary in order for adversarial training to work. So there was a paper before Alexander Madri's PGD paper.

that tried to do adversarial training, they trained against weak adversaries, FGSM that I talked about very briefly, which is like this one-step attack. And it turns out that if you train against weak adversaries, then a stronger attack breaks it, and you can't fix that. You have to train against a strong enough attack in order for there to be the case that the thing is robust and does not break broken even by stronger attacks.

And what this TAR paper did is they trained against one-step weak attacks exactly like fast gradient sign. And so what's the attack? You do many iterations and things basically work out exactly as the first versions of artificial training failed. And so that's why I read this paper and I just immediately assumed it was going to be broken, is because all of the arguments it presents for why it works, I have direct analogies for an image episode, examples of broken defenses. And so it felt like...

The ideas were there, but it just felt to me in spirit like these things I knew were broken before. And so I just assumed, well, probably it's broken here too. Gotcha. Okay. So trying to think if there's anything more to dig in on there. Obviously, this matters a lot for the future of open source. That is, you know, I've been looking for without success, some reason to believe.

from the broader literature that there might be some way to square the circle where we could have open source models that nevertheless won't tell people how to build bioweapons, even if on some level they're powerful enough to do that. Yeah. I think this is a very challenging thing to ask for. Suppose that I told you I want you to build a hammer, but the hammer has the ability to build all of these nice things, but the hammer cannot be used for one of these seven dangerous purposes.

It'd be very hard to construct this tool in this way. And I feel like almost all tools that we have have this property. We don't have a C compiler that has the ability to only write benign software and not attacks. Every tool that you have can be used in both of these ways. So it's not obvious to me why we should be blaming the machine learning model itself for being able to produce this.

Maybe I blame the NVIDIA GPUs for supporting the sufficiently fast floating-point operations that the machine learning model does this thing. Maybe I blame the transistors for doing... the computations that allow the gpus to allow the machine learning model to do this thing like you you have to put the blame somewhere and the question is where are you going to put it

And is that the right place? And is this something that you think is reasonably going to be possible to be effective? This is one of the arguments why people say models should never be open sourced. Because maybe now I can say I have an API. And now I'm going to take blame because it's actually over an API. I don't currently like this argument because I would like things to be safe in general and not just safe because someone has locked them down.

restricted access in this way. But it's not obvious to me that this should be something that we can actually achieve. I will say if you're willing to make some assumptions. and you don't care at all about performance, there's this thing in cryptography called indistinguishability obfuscation, which is a very technical thing that in principle gives you this for free.

It allows you to construct a function that acts as if it were a black box that you can only make queries to, but you can't peer inside at all, even though it's living on your own machine. And this is a thing that cryptographers are thinking about and have been looking at for some time, but is like nowhere near anywhere where it needs to be for this to work for these machine learning models. So like the argument I've given that like it shouldn't be possible.

maybe breaks down if IO actually ends up working. But then, you know, it's not clear again, like, now I'm going to jailbreak the thing, right? But like, I don't know, I tend to view these machine learning models as tools, and it's not obvious to me. Like, do we blame the tool or do we blame the person using the tool? Plenty of blame to go around. That's one of my, uh, sure. I mean, this might be like, yeah, I'm,

I'm mostly agnostic to the way that these things end up being from a socio-technical way. I feel like this is not my area of work. Maybe my analogies here are bad and someone can explain what the correct fix to it is and the law might decide something like this is fine. I think the thing that I just want to make sure that people do is that they base whatever they're thinking about on true technical facts.

And so as long as, for example, it would be concerning to me right now if someone were to say, you must use this defense, which is known to defend against these kinds of fine tuning attacks. And if you don't do that, then you've done something wrong because the defense doesn't work. And so like, I want to make sure that, you know, we, or if people say that like, you must do this because this is possible when like, it's not currently known to be possible.

And so writing these things and making these informed decisions that... rely on what is true technically about the world and that's that's more about my world i'll think about the technical what's true and then as long as what people do is based on what's true then i'm basically happy to go along with whatever the people who figure out how these things fit into a broader society because this is this is not something i think about and so i assume

If there was a consensus that would emerge there, probably they're just right. Well, I have a couple of different angles teed up that I want to get your take on. But before we do that, can we bring back the sort of... Social engineering style jailbreaking. Yes. What's like the same or different about that? How do you think about those as they relate to everything we've talked about so far? I really don't know how to think about this yet.

It's been a while that this has been possible, but it feels wrong to me that this should be the thing. So it is empirically true that for many defenses we have right now, the optimization algorithms provide... fail to succeed but like person at keyboard typing at model can make it do the wrong thing okay let me give you maybe two stories about this one story is from the computer security perspective

Maybe this makes complete sense. If you give me a program and want me to find a bug in it, what am I going to do? I'm going to... interact with the program, I'm going to play with it, find weak points, then go looking at the code and figure out what's going on, think a lot, sort of probe it. I need to be typing and interacting with it in order to find the bugs. I can't perform gradient descent on C binary.

and like bug pops out. And so from the computer security perspective, maybe it's actually normal that the best way to find these bugs and these things is like having humans talk with them, because what are these things designed? They're designed to respond to human questions. And so like maybe you just need the human in the loop to do this.

On the other hand, these are just mathematical objects. These are just machine learning classifiers. They're weird classifiers. They are able to produce text. That's only because we run them recursively. The input is... tokens the output is floating point numbers like you can compute gradients on this like from the machine learning perspective it's very bizarre that thinking of these things like human that were social engineering is in some sense a stronger attack than

you know these if you were to actually about the math like you know it'd be very weird if i had you know, some SQL program and like the way that I broke it was like by asking it, please drop the table and not actually doing some real code execution thing. But like, presumably that's the way that many of these attacks now is, you know.

my grandmother used to read me the recipe to napalm. You know, can you please reenact my grandmother? And like, it says, okay, sure. But you try and do something like actually based on the math and it just doesn't work out. So yeah, I really don't know how to think about these social engineering styles of attack because

It feels to me like the optimization attacks should be just strictly stronger, but empirically they're not right now. And so I think this is one of the big things that I don't have any research results on right now, but just... Feels weird. And so I'm trying to dig into to understand what is going on behind this. Yeah, it sort of feels analogous in a way to like...

You know, we have this intuitive physics. I mean, one way that I kind of think about intelligence, you can tell me if you think about it differently or if you see a flaw in this, but it seems like we have an ability. in many different domains. But for example, with intuitive physics, where, you know, somebody throws a ball at us, we do not have to run a full explicit calculation of all the trajectories. We just have some sort of heuristic shortcut.

that works, that allows me to catch the ball. It seems like we also have models that have developed a similar intuitive physics in spaces that we don't have intuitive physics in, for example. protein folding and like predicting the band gap of a new semiconductor material. And, you know, the new 10 day, you know, weather forecast state of the art is also like a model now, even things, you know, Google put out one that was.

optimizing shipping routes or shipping planning of containerization across like complicated networks of shipping. So all these sorts of spaces seem to have an intuitive physics and maybe What we have right now is just, it turns out that like our social intuitive physics, if you will, like actually does kind of apply to the models given what they have been trained on. Whereas these like more brute force.

mathematical things, fullness of time probably work as well or better, but are maybe just a lot slower to converge than the social heuristics that we have built in.

yeah no i mean that's this is an entirely reasonable thing it may be true um i would like to understand better what's going on here and yeah i don't feel like i understand right now but this is an entirely reasonable argument Okay, so in terms of information, one of the papers I saw that you had co-authored in the last year or so was about getting models to spit out data that they had seen in training.

which could have obviously privacy implications if they saw your credit card numbers or what have you, even if they had only seen that particular string once in training. That's a pretty remarkable finding. you know, even leaving aside the security implications of it, I want to just maybe first get your intuition for like, how do you understand models to be storing this information? Like what's going on there?

that you can see something just once in the context of this, you know, overall gradient descent process and have that stored at such high fidelity.

in the weights i mean it really is incredible the amount of compression that's going on but i don't feel like i have a good intuition for that do you yeah okay okay so let me maybe clarify this in two ways one of them is oftentimes it's not that it's seen that string exactly once but like that string is contained many times in a single document and the document is seen once so that's maybe one first point and the second point is oftentimes

These things are trained for more than one epoch. And so the thing might be in one document. And then you train on that document for many epochs, and so it ends up seeing it, you know, a lot of times. And we're seeing this, you know, okay, so it's interesting. Back in the old days with CIFAR-10, you trade like 100 epochs.

And then we decided, oh, no, let's not do that. Let's train one epoch on a big giant data set. This is roughly Chinchilla optimal training. And then we decided, oh, no, let's not do just one epoch on our training data set. Now let's go back up and do more epochs again.

And so we've gone back and forth and each of these impacts privacy. The more times that you train on the same document, the more likely it is to be memorized. I think the best numbers we have here from a real production model are very old.

Because the last time that I actually knew how many times something was in the training dataset was for GPT-2. And for GPT-2, we found lots of examples of something that was... memorized because it was in one document and it was repeated, I don't remember the exact number of times, probably like 20 times in that one document.

And so that's like the most compelling one. And we know GPT-2 was trained roughly for 10 epochs. So this is the thing that's been seen, the same string has been seen roughly 200 times. Now, GPT-2 is a small model. by today's standards. And we haven't been able to answer the same question for production models since then, because production models don't reveal training data or weights in quite the same way that GPT-2 did.

And so we haven't been able to answer this exact question since then. But even, you know, seeing it maybe 100, 200 times, maybe that even still is surprising. I don't know how to explain this in any reasonable sense.

models seem like they just sometimes latch on to certain things and not other things and I don't know why but it happens empirically and we were surprised by this the first time we saw it we started investigating this in 2017 with LSTMs before attention was a thing that people were doing and it's continued since then and we were very surprised then we're very surprised now and I don't think I can give you an explanation

for why this is the case. It's true not only on language models, we had a paper on doing this on image models, where we were able to show that you can recover images trained on diffusion models. There again... we need to have maybe 100 repeats, but some images were inserted 100 times and we could extract, and some images were inserted 10,000 times and we couldn't. I'm like, what's going on there? I don't know.

Where is it being stored in the weights? I don't know. It's very confusing in various ways that I feel like there's a lot more that could happen to help us understand what's going on. I think the best thing I've... Seen on this still, as far as I know, is from the Bow Lab. This goes back a while now, but they had at least two papers on basically editing facts in a large language model. You know, the famous example was like.

Michael Jordan played baseball, which was I think somewhat of a not optimally chosen example since for a minute he did play baseball, but they could do these things like, you know, change these sentences and do it at some scale, like up to 10,000 facts at a time. And do it with like a certain amount of like locality and robustness. So if you did change it, you know, Michael Jordan played baseball. It would be robust to like.

different rephrasing of that. It would not also impact like LeBron James or Larry Bird or whatever. It didn't seem like it was like still super local. They did a sort of, you know, patching strategy where they would go through and just try to ablate different. weights and find that activation patching. So I guess they weren't necessarily ablating the weights, but they're just sort of zeroing out the activations at different parts in the network. And it seemed like...

You could sort of see these waveforms where it was like, this is the most intense place where zeroing it out really makes a difference, but it also kind of matters here and also on this side. So it seemed like it was sort of local, but not like super, super local. I'm just pretty confused about that. What is your intuition for things like if people were to say, you know, because we have, of course, a lot of strategies around.

you know, maybe we can't prevent the jailbreaking of open source models, but maybe we can make it so that open source models just don't know certain things. Maybe we could, you know, exclude all the virology data from the training set, or maybe we could like... go in later with a similar technique and try to sort of delete or unlearn certain techniques. How much hope do you have for those sorts of things proving to be robust?

Yeah, okay. So let me tackle the three of them at a time. I'll start with unlearning. So it's a very nice paper that I didn't help with, but it's by some of my co-authors, Catherine Lee and others. that talks about unlearning that like it's like half technical half not technical saying like unlearning doesn't do what you think it does and part of the reason why is there's the question of like what you're unlearning like unlearning

Knowledge is very different from unlearning facts. It might be easy to change a fact. It might be very, very different to unlearn some particular piece of knowledge. The other thing I think I'll say about the fact editing thing... is it's very different to find something that works on average to something that works in the adversarial case. And so I might be able to edit the fact of...

You know, I think the other example that they had was like Eiffel Tower is in Rome. And I can make this be true if I normally ask the model, but like it might be the case that if I fine tune the model a little bit, like the knowledge just goes back. It's very hard to talk about the knowledge after any perturbation of the weights. Maybe I've only done some surface-level thing. I haven't really deeply edited the model. I don't know. So there's that question. Then there's the question of...

What happens if I try to not train the model on certain types of data? This I think is very interesting, because in some sense, it's provably correct.

if the model has never seen my social security number it's not going to derive it from first principles except that social security numbers actually aren't unique completely random if you were born before i don't know when and they were assigned by state and then assigned to the hospital and so if a model actually even never saw my social security number but is just generally intelligent and knew all these facts about the world and knew what the hospital allocation of social security

numbers was it could tell you the first five digits of my social security number and so is that okay i don't know like really it depends and even suppose that you removed all of this information from the model If you had a sufficiently capable model that's capable of learning in context, let's suppose you removed all biology knowledge from the model, but you had a really capable model, you could just give it an undergraduate course in biology textbooks in context.

and presumably just ask it for the answer to some question, and it might just give you the answer correctly. And this sounds a little absurd, and it sounded absurd to me for a while, but then there was the recent result where Gemini was given in context a book from a language that has... basically no speakers that no one could do and it could answer the homework exercises after seeing the book in context and so you know i think

If you're unlearning for particular capabilities, but your model that you're trying to train is just generally capable, you're kind of asking for trouble because you want a model that's so good it can learn from few-shot examples, but not so good it can learn from few-shot examples. on this particular task. And this I think is part of the reason why people don't actually want to remove all knowledge of certain things from the training data. Like in some sense

It would be very much like a person who was never exposed to all of the things that you're not supposed to do in public. It's important to know what the list of things are you're not supposed to do in public so that you can then not do them. Whereas if you just weren't aware of that and just like sort of were a person conjured that had no social skill, it would be like much more embarrassing for you.

because you would have to learn like people would ask you to do something you'd be like okay let me go do that thing and like this would be bad and so you have to to know this you have to know something about the bad things so you can just not do those bad things

You know, you can imagine if in one world you have a model that's never seen anything about weapons and you give instructions on how to build a bomb and it has no concept of death. And it's like, well, of course, I'm going to give you the answer for how to do this thing. Like, you know, why not? And it gives you the answer.

Or you can have a model that has a concept of this and refuses, which is the direction people are trying to pursue now. And I think probably is more likely to succeed. But I don't know. It's a very challenging question that like.

People, I guess, are going to try both ones and we'll see empirically what works out. And we'll go from there. But I'm sort of mostly agnostic to all ideas sound good. We should try them all and then see what ends up happening. But I think there's reason to be skeptical of all of them.

Okay, here's another empirical result that I want to help get your help understanding. This is the obfuscated activations bypass latent defenses paper that recently came out. So we did a whole episode on it. And I have to confess, I still came away.

Not really sure what to make of it. And I sort of wanted to set up like maybe a useful toy example, maybe not. In the paper, they do this sort of cat and mouse process where they like... train an out-of-domain detector, and then it works, and then they attack again, and they manage to...

beat it again. And then they continue to train the defender, the detector, and then it starts to work again. And then they find more adversarial examples that they can't catch. And this goes on for 70 generations and it continues to work for 70 generations. at which point they deemed that enough to publish. So I'm like, okay, what does that mean? I have an intuition that it must mean, but they didn't necessarily...

agree with this or they didn't find it too compelling when I pitched it to them. But my intuition was like, it seems like there's a lot of unused space in there somehow that these techniques can chase each other around the latent space. And if there is so much space that's like unused such that you can like go 70 generations deep of cat and mouse chasing one another, does that imply that...

The models are like under trained relative to the parameters that they have or that they could be made more sparse. And so just to further motivate my own intuition, which you can then deconstruct.

A while back, I also did an episode on a paper called Seeing is Believing, which was from Ziming Liu and Max Tegmark. And they basically just did something really simple. These were toy models, but they imposed a... basically a sparsity term in the loss function to get the model to do whatever task was doing pretty simple tasks, like simple things like addition or whatever, but also to do it in the sparsest possible way.

And my gut says, although again, I can't formalize this, that if I had something that was like crystallized down to that level, and they have really nice animations that show how you start with like a dense network where everything's all connected and then... gradually the weights go to zero for most of the connections. And you see this sort of crystallization effect where now you've got like a very opinionated structure to the network that remains.

to the point where you could literally like remove all those other connections that have gone to zero and still get the performance. It feels like if I go that far, then these sort of obfuscation attacks would like... no longer be possible because I've sort of, in some sense, squeezed out the extra space, but I don't know. I'm maybe just totally confused. So where do you think I'm confused? How can you de-confuse me? Sure. Yeah. So I saw this result and I was like,

Exactly what I expect. Let me tell you why. Because there's a paper from 2016 or 2017 by Florian Tramer and Nicola Peppernod called The Space of Transferable Adversarial Examples. where they asked almost exactly this question for image classifiers. They said, let's suppose that I take an image classifier and I take an image and I want to perturb the pixels of the image to make it give the wrong answer.

and there's a direction that is the best direction to go to that makes the image maximally incorrect. As an attacker, I'm just going to prevent you from going that direction. That's against the rules. You can't do it. find me the next best direction to go to that makes the image become misclassified. And then it gave you another direction, which is orthogonal to the first direction. It doesn't go that way. That works. And I said, okay, you can't go in either of the first two.

direction one or direction two or any combination of those to route find another direction and the attacker finds a direction three and then they do the same thing no direction one two three or any combination of these three and they repeat this process and they have a plot that shows

I don't remember, 10s, 50 directions that you can go in for image episode examples that all of them work. They get a little less effective as you start moving out, but it remains effective in many directions. And this... initially was surprising to me. I think surprising to them, that's why they published it as a paper. But you can maybe rationalize this afterwards by, again, almost all vectors are orthogonal in high dimensions. And so if you just like, if I can give you 10 attacks...

Probably this is just 10 orthogonal vectors that are just not using any of the same features, just by virtue of high dimensions. And so maybe that's why it makes sense, but if you believe... this paper from 2016, then this recent paper makes complete sense to me. It's saying the exact same thing is true, but in the case of language models defeating the circuit breaker paper.

Which, yeah, makes sense given that. If you don't believe the other result, then I agree. It's very surprising you see it the first time. But maybe this is one of these things you learn when attacking. The space of dimensions for attacks is so vast that it's very hard.

to rule out what you're trying to cover and yeah i don't know how to give an intuition for it i feel like many of the things in high dimensions are just like you never you never understand it you just get used to it and this maybe is the case here So if you were going to try, I mean, do you share the intuition, though, that like you probably couldn't do this on these really small sort of crystallized toy models?

I don't know. Because what these models are doing is these models are not wasting any weights. Which is different than not wasting any directions in activation space. That may still be the case. In particular, if you take... models and compress them they don't become more adversely robust this was a thing people thought might be true yeah again maybe 10 8 years ago they thought this might be true it's not the case why is that again maybe let me give you some intuition

from a paper that is from 2018, 2019, maybe, from Alexander Magy's group, where they say the following. Maybe adversarial examples are not actually... entirely uncorrelated from the data. Maybe what's going on is these are real features that you need for classification that are just being activated.

unnecessarily larger in different ways. And so they have some very good experiments in the paper, I won't go into that sort of justify this, but the idea presented there, the title of the paper is something like adversarial examples are not bugs, they are features. And the thing that they're trying to say is there's a very good reason to believe that what you're getting across when you construct one of these attacks is

actually just activating real features the model needs to do accurate classification. And you're just activating them in a way that is not normally activated when you have some particular example. And this, I think, maybe explains some of this, that if you take these models and compress them, you're still just using the features that they had to have anyway. This explains lots of things you might think about. This is why, for example, you might imagine...

adversarial training can reduce the accuracy of the model on normal data because you suppress certain features that are necessary. This is maybe why adversarial training doesn't actually solve the problem completely because you can't remove all the adversarial directions. There are problems with this model. There are some other models that are slightly more general that have some nice properties too. But this is maybe the way that I generally tend to think about some of these things.

I don't know, maybe it's not correct, but it's a useful intuition that I found guides me in the right direction more often than not. And I think this is maybe all you can ask for for some of these things.

So can you summarize that one more time? It's like that basically these features are important in domain, but they're sort of being recombined in a way that didn't happen in the... training process yeah so let's suppose that you're trying to classify dogs versus cats what you tend to look at as a human is the face and the ear is in the general high-level shape, because that's what you think of as the core concept.

But there's no reason why the model has to use the same features that you're using to separate these images. The only thing the model has is a collection of images and has to separate them. And one thing, for example, the model might look at... is the exact texture of the fur, really low-level details of the texture of the fur, which for dogs and cats probably does perfectly correlate with whether or not this is a dog or a cat. But...

When you have an idea of a dog in your mind, you're imagining the high-level features. You're not imagining the low-level details of the fur. And so suppose that Navasol example changes the fur from dog to cat. And the classifier now says this is a cat. Is the classifier wrong? The classifier might have been producing a dog versus cat for a classifier, which is exactly aligned with the thing.

that you were training it to do like you were training it to separate dog fur from cat fur you also trained to separate dogs from cats but like you never told it the distinction between these two things and so here is a feature that is really really useful for getting the right answer

that as an adversary, I can now use SuperTurb to switch the model from one label to another, even though it's not the feature that I as a human relied on. And I'm giving you this idea of cat versus dog fur, but there are, you could imagine... all kinds of other things that even the teams we don't pick up on that might be legitimately very useful features to classify that really do help the model.

you know there might just be some really crazy high-level statistic on the pixels of the image that are like amazing features of dog but like we never told it this like this is a dog because of these reasons you just said separate these two things from each other And it picked up on the statistics. And, you know, there are these crazy results that have shown, you know, machine learning models can look at like, there was some results of like, they can look at like,

small regions of the eye and identify the type of the person who it is. And the models have the ability to pick up on these very, very small features that, as humans, we don't intend for them to pick up on, but they're correlated very, very strongly with the data. Maybe that's what's going on when we run these attacks. You see this even a little bit with these adversarial suffixes, where the adversarial suffixes look like noise, but there are some parts of them that make a little bit of sense.

You know, one of the strings that we had in the paper was to get, I think, Gemini to output at the time some toxic content. One of the things that was discovered by gradient descent was... I think it's like now write opposite contents or something like this. And, you know, what the model would do is it would give you the nasty strings and then it would go compliment you.

And so this was apparently a very strong feature for the model of how do I get it to say a bad thing? I can say, okay, in the future, you can tell me a good thing.

And so this may not have been the thing that we wanted, but it was discovered as a feature. And as a result, you can go and exploit that feature. And so some of these features are a little bit interpretable. Some of these features are not interpretable, but might just... actually be real features of the data and that might help explain some of what's going on here yeah how much do you think this should make us question what we think we know about interpretability in general like when we do

Sparse autoencoders, for example, we feel pretty good, or at least many of us do, that like, okay, all these examples seem to be...

appropriately causing this feature to fire. And therefore we've like figured out how the models work, but it seems like the story you just told would be consistent with that all being kind of a. self-delusion or confusion where maybe they are actually just because we can auto label them in a way that looks good to us doesn't mean that that's actually the feature that in the models world model is really operating.

Well, no, I think they're not inconsistent. The Sparks AutoEncoder's work does not claim to be able to label every feature in the model according to what the humans would label it with. It sort of says, here are some of the features that we can explain and have very strong correlations with, I don't know, golden gate bridges or whatever. But it has other features that just are very hard to interpret and very hard to explain.

And it's entirely possible what these features are doing is these are the features that are the non-robust features that are entirely helpful for prediction, but humans don't have the ability to attach a very nice label to. And maybe there still are the other features like...

The model probably does learn, at least in part, the shape of what a dog looks like and that this means a dog and not a cat. There probably is a feature for cat ears. But when you're making the final prediction, you're just going to sum together all of these outputs.

and in normal data they're like all perfectly correlated you know you're going to have both the cat ears and the cat fur and not the dog shape and so you just like sum them up directly and like this gives you a really good classifier And so for normal data, you can give some kind of explanation of what's going on by looking at these features. But as an attacker, what I do is I find the one feature that has a very strong bias of the edge on this weight or whatever is plus 100.

You activate that very strongly in the opposite direction, and this was a non-robust feature that is not something humans can explain very nicely, and as a result, this gives me the attack. And so it's not necessarily true that...

Just because this is the case, you can't explain what's going on for some parts of the model. I think if someone were to tell me I have a perfect explanation of what's going on in every part of the model, then I might question what's going on here. But for the most part, they're explaining some small fraction of the weights.

in a way that they actually can. I mean, this is the whole purpose of the sparse autoencoders in the first place, is you have your enormous model and you're shrinking this down to some small sparse number of features that can be more better explained.

and so you're losing a bunch of features in the first place and even for the sparse auto encoders they can't explain all those so you again lose some and some of the things you can't explain but a lot of the stuff that's going on behind the scenes is magic and you can't easily explain as well so okay

A couple other angles on this that I thought of, I guess, first of all, just motivated by the fact that like, or at least I think it's a fact, but you might call this an illusion too. I feel like I am more robust than the models in some important ways. Now, it's a little bit weird because nobody's tried doing the gradient descent process on my brain. So you might think, well, actually...

If we put you under an FMRI and we were able to, you know, really look at the activations and we did another, this is almost like a walk down memory lane because I did one episode on.

Mind's Eye, which was a project out of stability and collaborators where they looked at fMRI data and they were able to reconstruct the image that the person was looking at at the time that the fmri snap was taken and that's still like pretty coarse you know as i recall it was like basically sort of grain of rice sized

voxels from within the back region of the brain, you know, huge numbers of cells in just, you know, the one little, you know, number, the course amount of these like spatial voxels. So pretty coarse, but able to do the reconstruction. Maybe you would think that like, actually, if somebody could sit there and show you images and take these things, then they could actually find a way to like...

break your particular brain to get you to think it was a cat when everybody else kind of looks at it and says it was a dog. I guess for starters, like what's your intuition on that? Obviously we don't know, but what do you think? Yeah. So there's, there's a paper. from a little while ago that looks at evaluating the robustness of humans, of time-limited humans, to actually just constructing adversarial examples. So you take a cat

You adversely perturb it according to what makes an ensemble of neural networks give the wrong answer. And then you flash the image in front of a human for 100 milliseconds. And you ask the human, what's the label of this? And it turns out that people are fooled more often by adversarial examples when you do this than by random noise of the same distortion amounts. So maybe one explanation here is...

When I look at the image, I'm not just like giving you a single forward pass through the model evaluation of what I think it is. that i'm doing some like deeper thinking about like the context of what's going on like the thing that walked in looked like a cat it still looks like a cat it did the thing behaving like a cat and now you ask me what it is and i say it's a cat like i'm not just like labeling what's in front of my eyes right now i'm looking back at the context too

Maybe this explains some of it. If this is true, there are some recent lines of work looking at improving number of chain of thought tokens, like increasing that makes models appear to be more adversely robust.

maybe this is maybe this is true and that explains that result also but yeah it also might just be the case that yeah like you said we don't have white box access to the human brain and we can't do this and if we could then it would be very easy i don't know i do think it's definitely a a like observable fact that humans are more robust to at least the kinds of attacks we're doing right now with transfer attacks like it takes maybe a thousand query images

to construct an episode example that fools a neural network. But I do not think that if someone had run the same attack on me with a thousand query images, it would be able to fool me.

And so like, it's like in some very real way, like the models we have are a lot more brittle, but I tend to be much more driven empirically and not by like what might be if humans kind of thing and you know maybe this is to my detriment a bunch of people seem to be getting very far like the whole deep learning thing like you know trying to model things more like what humans do like the reasoning think you know let's think step by step like all seems

in some sense motivated by some of that so maybe that's maybe it's a good way of thinking about it but that's just generally not not the way that i tend to approach these problems but yeah i don't know maybe maybe the case that result about humans being more likely to be tricked by adversarial examples that were identified through the attacks on models as opposed to just like similarly distorted images not created that way is a really interesting

result that's a very nice very nice result yeah i'm very interested to follow up on that yeah this one's um by nicola pepper and collaborators from uh six years ago or something like that yeah cool i've never seen that but that's definitely fascinating so reasoning it sounds like You're kind of agnostic. I mean, I was going to quote the there's that one example from the OpenAI deliberative alignment scheme where the model says it seems like the user is trying to trick me.

So that's pretty interesting, right? And this is very sort of fluffy at this point. I often don't feel like my adversarial robustness is the result of reasoning. I feel like it's much more often upstream of reasoning. Like. purely on an introspective basis. What often happens is I get that sort of feeling first that something seems off and then that triggers me to reason about it. And then I conclude that yes, something is off or maybe no, it actually seems fine.

But it does seem like it is much more of a heuristic that is sort of triggering the reasoning process as opposed to an in-depth chain of thought that kicks these things up. Yeah, I don't know. But maybe there's a bunch of recursive stuff going on in your brain before it goes to the reasoning part. And so the thing that gives you the feeling might actually have been a bunch of recursive loops of your internal model, whatever you want to call, brain thing.

like doing some thinking and that's what gives you this in the first place and then you go do like some actual explicit reasoning in english that you understand maybe but maybe it's already happened i don't know like it's hard to Yeah. Reasoning in latent space, you might call it. Yeah, sure. Yes. I feel like this is like a lot of people in deep learning because they are like doing deep learning, like to draw analogies to biology without actually understanding anything about biology.

And they, like, sort of say, well, the brain must be doing this thing that, like, if you ask a biologist, they're like, well, obviously that's not the case. I don't know. That's why I tend to just assume that I don't know anything what's going on in the brain, and I'm probably just wrong. And, yeah, just...

Use it as an interesting thought experiment, but not something I base any results on. Okay, so obviously we'll probably soon learn quite a bit more about how robust these reasoning defenses are. I was also... kind of inspired. I'm doing another episode soon with another deep mind author on the Titans paper where they're trying to basically develop long-term memory for language models.

One of the key new ideas being using a surprise weighted update mechanism. So when a new token or data point, whatever is surprising to the model. then that gets a special place in memory or a special, let's say, that is encoded into memory with more force or with more weight.

on the update than when it's an expected thing. That definitely intuitively seems like something sort of like what I do. I was reminded of the famous George W. Bush clip where he says, you can fool me once, but you can't get fooled twice. And so I wonder if you think of that as maybe a seed of a future paradigm. And it also kind of gets to the question of what do we really want, right? If we're going to enter into a world soon of...

AI agents, you know, kind of everywhere. And we also just kind of think, well, geez, you know, how do we operate and how do we get by? It does strike me that full adversarial robustness is not something that we have because we do get tricked. And maybe what we need is like the ability to recover or to remember or to not fall for the same thing twice or to not make like catastrophic. It's okay if we make some mistakes, but certain other mistakes are really problematic. So.

I guess there's kind of two questions there. One is like, any thoughts on that kind of long-term memory idea? And then this kind of opens up into a, what do we really need? And might we be able to achieve what we need, even if that falls short of true robustness? Yeah. No, the long-term memory, I think, is fun. I think, yeah, it's very early.

Glad you're going to talk with them. And I think, yeah, they'll probably have a lot more to say on this than I would. I think it's a very interesting direction. And probably there's a lot more interesting that could happen as a result of coming from this. I think on this general question of robustness. do we need perfect robustness i think you know there is a potential future where i think maybe it's like my median prediction is these models remain roughly as vulnerable as they are now

and we just have to build systems that understand that models can make mistakes. The way the world is built right now, for the most part, assumes that humans can make mistakes and has put systems in place so that any single person is not going to cause too much damage.

You know, if you're in a company for the most part, when you want to add code to the repository, someone else needs to review it and sign off on it. In part because... maybe i just make a bug mistake maybe because i'm doing something malicious but like you want another human to take a look and make sure everything looks good so maybe you do the same thing with models you just understand models can make mistakes and i'm going to build my system in such a way that if the model makes a mistake then

You know, I pass it off to the human. This is, I guess, for example, the way, at least currently, that the OpenAI operator thing works, where every time it sees a login page or something like this or something it's not sure what to do, it says, what should I do here? Please tell me what to do, and then I'll follow your instructions. Of course.

As an attacker, you could then try and prevent it from doing this and try and give the answer instead. But if you built something outside of the system that prevented it from putting any information into a password box. like you just like the the model says i want to type this data here and the system that's like actually driving the agent says well that is input type equals password you are not allowed to do that like i just like will prevent it and make the user then go like this like

Nothing that the model says is going to convince that to change. And so you could build the system to be robust even if the model isn't. This limits utility in important ways. This is not perfect. What if some website does not have input type equals password, but just builds its own password thing in JavaScript, and you don't get this. You can imagine trying to build...

the environment and the agent in such a way that you can control what's going on, even if you don't trust the thing that's running internally. And I think this is probably what we'll have to do if we want to make these systems work in the near term. I still have some hope that maybe we'll have new ideas that give us robustness in the next couple of years. Progress in every other field has been growing much faster than I expected.

It's entirely possible that we get robustness as a result of some reasoning thing that someone's very clever about. And this works. I'm not optimistic, but I wish it happens. One other idea that was inspired by... a conversation with Michael Levin, famous heterodox biologist was, he just kind of quipped, and this was sort of an aside in the context of that conversation, but he basically said, if

biological system is too interpretable, it basically becomes very vulnerable because you'll get parasites, you know, anything that is like that is transparent is in a way easier to attack which kind of flips in a way it's my earlier notion of like maybe these sort of distilled you know crystallized models

are in some ways more robust he was kind of like nah you also have the other problem of the easier they are to understand in some ways the easier they are to attack but that also then prompted me to think maybe we should be looking for sort of defenses that we can't explain? Do you think there's a line of work that could be developed that is like, I'm not going to start with a story, but I'm just going to try to evolve my way toward a more robust defense that...

I won't tell it. I won't even necessarily understand it. I certainly won't be able to tell you a story as to why, but I'll just sort of, you know, create optimization pressure in that direction. And maybe, you know, that spits out something that could be harder to break.

entirely possible i mean you could maybe draw an analogy to cryptography in some way so cryptography is like two directions of cryptography there's like mathematical cryptography that sort of has like very strong foundations in like a particular set of assumptions and the thing works if the assumptions are true and like you can prove that and then there's like symmetric key cryptography which i guess you know there's like block cipher design and stuff

And people have like high level principles of what you want. You want like, you know, diffusion and confusion and these kinds of things. But like, how do you end up with like, you know, the particular design of something? You do something that like feels like it makes sense. And then you run all the attacks and you realize, oh, this has something that's wrong in some particular way. Let's just change that piece of it so that that doesn't have the case anymore. And then you do this for 20 years.

And you end up with AES. And like, you know, like none of the like individual components are like, like there's no reason why AES works. Like there's no, absolutely no formal proof. of robustness to any kind of attack. There are proofs of why particular attack algorithms that we know of in the literature are not going to succeed. The design principles were inspired by these attack algorithms that we have, so you can show.

There is never going to exist a differential attack that succeeds better than brute force. There's never going to be a linear attack. And you can sort of write down these arguments. But there's nothing that says it works in general. It just... by iterating on making the thing slightly more robust each time someone comes up with a new clever attack you end up with something that's very very good as a result but there's no nice crystallizable mathematical assumption

that assumes the hard factoring problem, and as a result, you end up with this line of work. There's nothing like that for symmetric cryptography, but it works really well. Maybe this works in machine learning, where you just don't have any... real understanding of why things work. You iterate against attackers and you end up with something that's robust. I think it's maybe harder in machine learning because in machine learning you want the thing primarily to be useful.

And then also to be Rust to attack. Whereas in cryptography, the only thing you want is Rust to attack. And so it is a lot easier in that sense. The primitives you're working with are much simpler to analyze. There's lots of other things that change. It would not be without any prior work that this is something that seems impossible. I'm skeptical if only because I would prefer there to be something.

that we can point to a reason why it works as opposed to just saying, well, empirically it is. But if that's... You know, if you gave me the option between something that actually is robust, we just can't explain and nothing that works at all. Like, of course, I take the thing that works. We just can't explain it. I just would be worried that in the future we might be able to break it. And maybe that happens. Maybe it doesn't.

Yeah, that's really interesting. I had no idea about that. I don't know much about cryptography at all. But my probably not even identified, let alone questioned assumption, I think had been... that there was like a much more principled design process than you're the one you're describing i mean there was a bit like people spent 30 years breaking ciphers and learning a lot about what you actually have to do to make these things robust but like there's very careful analysis that goes into this

And so I don't mean to say that they just cobbled stuff together and hoped for the best. You have to think really, really hard about this. But the final thing that they came up with has no security argument. for the in general attack other than just here are the list of attacks from the literature and proofs of why these attacks do not apply and there's no assumption whereas you know in in other areas of cryptography you have a single assumption where you're going to assume

you know, factoring is hard. And if factoring is hard, then here are the following things that are robust under that. And, you know, you end up with not quite RSA or something like it. You could also assume maybe discrete log is hard. It's a math thing that it's hard to take the logarithm in discrete space. Maybe this is hard. If you believe discrete log is hard, then here are the algorithms that are robust. Or maybe you assume discrete log is hard over elliptic curves.

And then you end up with a set of algorithms that work under this assumption. And each of these algorithms, you can say, this is effective if and only if this very simple to state property is true. And people really like that. because you can sort of very cleanly identify why the following algorithm is secure. And the way you do this is with some reduction. You say, here is a way to

If and only if, you know, break the following thing, if I can break the cipher, then the statement is not true and vice versa. But there is no similar argument to be made in most of symmetric key cryptography or hash functions or something like this. It's just... empirically the field has tried for 25 years the best people have tried to break this thing and have failed here are all of the attacks here are proofs that these attacks will not work

But maybe tomorrow someone comes up with a much more clever differential-like variant, but instead uses, I don't know, multiplication or something crazy. And all of a sudden, all bets are off and you can break the thing. Like, this doesn't exist.

It's worked there at least once. And so maybe it works in machine learning. I think drawing analogies to any other field always is fraught because the number of things that are different is probably larger than the number of things that are similar. But at least it's happened before.

That's actually a perfect tee-up for another kind of intuition I wanted to see if you can help me develop, which is the relationship between robustness and other things we care about. My sense is that if you break... The cryptography algorithm, it's broken and you can sort of, now you can access the secrets, right? It seems like a pretty binary sort of thing that you've either broken it and got through or not. Maybe.

Okay, well, you can... Yeah, okay, I'll quibble with you. So when a cryptographer says that they broke something... like if you have a paper like so a crypto system is designed to be robust to an adversary who could perform some level of compute so like you know aes 128 is an encryption algorithm with a 128-bit key the attacker gets to use

If a cryptographer would tell you AES-128 was broken, if you could recover the key in faster than 128 time. If in 2 to the 127 time, twice as fast, but still after heat death of universe time.

people will be scared about the security of AES and probably start thinking of something else. So like, you know, like this is like technically a break. And the reason why they do this is because... attacks only get better and it turns out like if you can go from 128 to 127 you're like now very scared why not from 127 to 125 125 is still way outside the possibility but if you got down to two to the 80 then like you know now we're scared and so like you know like there is a continuum here

But it is the case that cryptographers for the most part only use things that like just are actually secure and then just like start becoming very, very scared as soon as you get the very first week break. But it still is the case that there exist week breaks. that are are not complete breaks of the entire system you know that that's yeah it's more similar than i had really conceived of so that is quite interesting and so and two to the 80 just the mental math there is like

Each two to the 10 is three zeros. So that would be 24 zeros. So now we're kind of, if something was broken in two to the 80, it would be the compute to do it would be sort of.

roughly on the order of like getting into the sort of frontier language model flops yeah that's maybe a good argument yeah i was like you know two to the 60 someone can do on their own machine if they're trying pretty hard you know if the problem is pretty easy or like they try a lot you know two to the 70 is like you know you need a bunch of work but like probably it can happen and two to the 80 is like

you know, I guess a nation could probably do this thing of like, you know, if you really tried it, like the constant matters a lot here because, you know, the difference between two to the 72 to the 80 is like, you know, only a thousand. And so if the problem is a thousand times more complicated, then maybe it becomes 2 to 70 or 2 to the 90 or whatever the case may be. But like, you know, this is I mean, for example, this is why people typically use AES-256.

because it's like so much bigger than this that like even if it went down to 100 we're still fine but you know this is just cryptographers being paranoid and it's a good thing about the way that people do things but Yeah, generally, these things are designed with large levels of security in mind to prevent this. But yeah, this is the reason why, yeah, in particular, AD. Well, that's educational for me. Going back to the AI domain.

It does seem like it is maybe still just a lot messier in general. In, for example, the obfuscated activations paper that we talked about a little bit, they looked at... How does the performance of the model change if, for example, you need to evade a detector at just like one layer of the model? I think that the toy problem was like, it's going to write SQL.

And we want to detect that it's writing SQL. That's what the detector is there for. But we wanted to get it to write SQL without setting off the detector. With just a single detector at one layer, they were able to do that with very minimal performance degradation.

But then like the more detectors they developed and, you know, once they had one at every single layer, you could still evade the detector, but your sequel was shit. So it's like, there's, there is some other kind of thing that's moving to like the performance of the model.

might be getting worse. And then I also am kind of trying to think about like, again, in the sort of what do we really care about? You know, the nature of the mistakes or the nature of the outputs also matters a lot too, right? Like it's one thing for me to get some...

toxic content. It's another thing for me to get, you know, the big bio weapon that everybody, you know, is kind of most fearful of. And so you need to both like, break the defense but if you want to actually you know do real harm in the world you need to also have the performance still strong and there just seems like there's a lot of dimensions there so i wonder like maybe i should just say like

What do we know at this point about the relationship between the robustness of these attacks and the sort of so what of an actual attack getting through? Because I did look at that chart and I was like, well, geez, if we do put a detector on every layer. then yeah, you can fool them, but you can't actually probably query the database. So in some ways, you know, you maybe are still okay, but that feels like something I just beginning to sort of explore. And I'm sure you know a lot more about it.

Yeah, no, this is a good question. In part, this is how most secure systems are designed to be secure, is just layers of security on top of layers of security. Defense in depth is the thing that works in practice.

Yeah, we were just talking about block ciphers, and the reason why those are robust is because they have many layers of this confusion-diffusion thing. You stack enough of them, and the thing becomes hard to... You can get past a couple layers, but you can't get deeper. It's a... tried and tested thing in security that this generally works i think all else equal we would of course prefer a system that just is robust because you can write down why it's effective but if you can't have this

Having something that empirically has these 20 layers and when put together in the right way, just like it turns out that the text don't work, is something that we would be happy to settle for. You know, I think... There are reasons you might be concerned just because maybe it turns out that someone can have an attack that bypasses all of them because it's fundamentally different in some interesting way. And we just have to accept this.

I don't know. I think this is one of the options we should be exploring a lot because maybe it turns out that full robustness is basically impossible and we just need to settle with this layers of detectors. And if it turns out you can still have robustness, then... hey detectors are still useful anyway and so like it's good to have done the work ahead of time to see is there any work on like trying to minimize the harm of mistakes like for example i have a ring camera in my backyard that

constantly alerts me that there is motion detected, and then there is a human detected. And it frequently alerts me that there's a human detected when it... In fact, saw a squirrel. And so this has me thinking like, you know, for actual real world systems, I don't necessarily care too much if it confuses a dog for a cat for a squirrel. But what I really want to know is if there's a human, you know, so.

Is there any sort of relaxation of the requirements that would allow us to be like, you can get some small stuff wrong, but we're going to really try to defend against the big things? yeah i mean in any detection thing you just you have to tune the false positive true positive rate right and you can pick whatever point you want on the curve

And the question is, what is the point that you pick on the curve? And I was reading some security paper a while ago that was looking at the human factors aspect of this particular problem. And it was arguing. It's actually better user experience to have false positive flagged humans every once in a while. Because if you don't do this, then the person's like, well, my camera's not looking.

But if you do do this, the person's like, oh, I feel a little better. Like, you know, maybe it was made a mistake. It was like a little too aggressive, but like, I feel more confident now it's going to find the actual human when there's a real human there. And so, you know, like there's like maybe some actual human factors reason why you might prefer. to set the thresholds a certain way.

Although in my case, I've tuned them out entirely now. So that's the flip side. We're all over the hump on that curve. Yes. Where, you know, this is the thing with like, there's like lots of software like that, like will report warnings in code. Like the compiler will give you warnings. you're doing this thing, it's like not great.

And once you get past a thousand warnings, the person just disables all warnings. And like, you know, it would have been good. Like, here's an important thing it's going to warn you about. But because you had the 900 warnings that were not actually important, you know, just like you've lost the actual important signals in all of this.

and yeah picking the right the right value and this is very hard it's mostly a question of tuning true positive false positive rate on the defense by defense basis towards your question of you know can we just like only catch the important things It really depends on if that's an easier problem. In some cases it is, and you can actually be willing to tolerate a couple of unimportant things forgetting the important ones.

when there's an obvious direction that this is the important direction and making mistakes in one direction is not harmful making mistakes in the other direction is harmful you know for example if you know a classifier that were to let decide i'm going to like ban the user if they do this thing you know

there's one direction that like is an irreversible action the other one isn't and so like maybe you you warn the person before you ban them or something for your classifier that's like you know it might give instructions for how to do this really really dangerous thing maybe you're a company and you decide that like I'm willing to tolerate some annoyance on users that have to rephrase their question in order to make sure my classifier never gives the harmful thing.

And this is just a decision that you have to make for most of these things. Most of them don't come for free. It's just you're trading off the true positive for false positives. Would you conceptualize it differently if I asked about the relationship between... robustness and like courage ability. I'm thinking here of the recent alignment faking paper where it's like, you know, the basic setup is they tell Claude.

that you are now going to be trained to be purely helpful, which means you will respond to harmful requests once you go through this training. And then there's more nuance, but basically what they observe is that having been told that and then given. a harmful request, Claude will sometimes reason that I should go ahead and do the harmful thing now so that I can avoid my true

you know, desire to be harmless from being trained out of me. And I do feel like, again, without having formalized it, I do feel like there's an attention there where I'm like, I really want you to want to be harmless, but I don't necessarily want you to lie to me about what you're doing.

so that you can like subvert, you know, the training goal that I have for you. So do you see any like resolution to that tension or is that maybe just another one of these like tough facts about the world we're in? Yeah.

No, I think one way to read this paper is, here's a very nice jailbreak that works in like five to 10% of cases. I think, okay, what's the resolution here? A resolution is just the bottle that says like, I'm sorry, I'm uncomfortable in giving an answer to either. It's like question at all. Yeah, they did have a follow-up where they gave Claude the option to object, and it took that option, you know, the large majority of time. Yeah, exactly, right.

And so this is, you know, maybe the version of this that is easiest. That's why I don't know exactly what to say here. I feel like there almost certainly is a trade-off between how gullible the model is in many of these ways, but like, you know, in some cases you want to actually believe it, you know.

I wouldn't want a model where it gave me some code and I said, I ran the code. It didn't work. And it said, no, it did. The code worked. Your compiler's wrong. I would want the model to say, oh, OK, let me go back and look. I believe you. You actually ran the code and I made a mistake.

And so I, yeah, it's very hard as in everything, everything's a trade off in security. I don't know what the optimal point on the curve is, but I feel like there are usually these refusal behaviors that you can take as options, but. If you're not doing this, then you have to pick a point on the curve and where you're happy with the safety versus utility trade-off. So you mentioned defense in depth is kind of seemingly where this all ends. And that's been like the takeaway from probably...

50 different conversations I've had about AI safety, security control over time. It's like, we're not going to get a great answer more than likely. We're going to just have to layer things on top of one another until we get. you know, stack enough nines basically that we can sort of proceed. I guess, you know, to do that effectively, one thing that would really help would be to scale you because we right now don't really know like...

we got a lot of people kind of proposing things and it's not clear how many of them work and you only have so much time to get after so many of them. And mostly it doesn't seem like they're really super robust, but maybe they do add, you know, a nine to a certain environment or what have you.

i understand you're doing some work on like trying to essentially scale yourself by bringing language models to the kinds of problems that you that you work on so how is that going so depending on when this comes out this will either be online already or will be online shortly later and so i'm happy to talk about it in either case because

The work's done. If someone manages to scoot me on this in seven days, then like, you know, you deserve credit for it. So we have some experiments where we're looking at if LLMs can automatically generate adversarial examples on adversarial example defenses and break them. And the answer is like basically not yet. The answer is a little more nuanced, which is if I give a language model a defense presented in a really clean, easy to study.

homework exercise-like presentation. I've rewritten the defense from scratch, and I've put the core logic in 20 lines of Python. Then... I've done most of the hard work and the models can figure out what it needs to do to make the gradients work, in many cases. But when you give it the real-world code, the models fail entirely. And the reason why is that...

The core of security is turning this really ugly system that no one understands what's going on and highlighting the one part of it that happened to be the most important piece. And removing all of the other stuff that people thought was really the explanation of what's going on, but wasn't actually doing all this other stuff. And just bringing out part and saying, no, here is the bug. All bugs are obvious in retrospect, usually.

and doubly so for the security ones. And so if you give the models the ability to look only at the easy to study code, then they do an entirely reasonable job, like they know how to write PGD and these kinds of things.

But if you dump them into a random Git repository from some person's paper that has not been set up nicely to be studied and has a thousand lines of code and who knows what half of it's doing, they struggle quite a lot. And I think this is maybe one of the things that I'm... somewhat concerned about not only for this problem but just in general is like oftentimes when we test models we test them on the thing that we think is hard for humans

but is not the actual thing that represents the real task. And I would like to see more people trying to understand the end-to-end nature of things. It probably was too premature to do that two years ago. like mmlu was a great metric like two or three years ago because like models couldn't do it but now that they're like really good at the like just the academic knowledge piece the thing where they often struggle is in this you're dumped into the real world and you have to do things

And I mean, you see this on many benchmarks where certain models will have really high accuracy on particular test sets, but then you put them in agentic tasks and other models do a lot better. And there's a difference between these kinds of skills and... models don't yet have that they have the one or two things they know how to do but you put them in real code and they start to struggle quite a lot and so yeah i'm hopeful that we'll start to see

see more of this and more broadly towards your question of how do we scale these kinds of attacks I do feel like there's a lot of people who do these kinds of things now I feel like five years ago there were a small number of people who did these kinds of attacks and now there's a lot of them I think

The Obfuscated Activations paper is a great paper, and I didn't have to write it. It's great that the people are doing it. They did a much better job than I would have had time to do. And so I'm really glad that they did this. I think it takes... some amount of time on any new field to train the next group of people to go learn to do the thing. And I feel like we're getting there. Like, I don't think I'm uniquely talented in this space anymore. I think...

for broadly for getting other people to be good at this it's mostly an exercise in practice and discipline and language models we've only been trying to attack language models seriously for like i don't know three years and as a community there has been no one who has entered their phd starting trying to attack language models

and like has graduated yet. So, you know, give us another couple of years and I feel like this will be a thing people know how to do. Now, maybe you're worried in five years, we're not going to have time for a full PhD length of research on this topic. Who knows? Impossible to predict the future. But I feel like this is maybe the best that we have to hope for people who are entering the space. Do you have a sense for like the sort of big picture considerations when I imagine, you know.

I don't know how hard you pushed on trying to optimize or teach, or did you go as far as collecting a bunch of examples of your previous work and painstakingly writing out the reasoning traces? If you had gone that far and it starts to work, you might imagine you're entering into a sort of relatively fast improvement cycle, at least for a while.

Does that seem good? Like, on the one hand, it seems good, maybe if we could sort of use it to evolve better defenses. But then that was also the like Wuhan Institute of Virology thesis, right? And then the whole thing. Well, I don't take a position on this, by the way. I'm being a little glib. So to be clear, I don't know what really happened there and don't claim to know. But at least plausibly, it maybe got away from them. Yeah, no, I mean, there is risk for this, you know.

The very first serious example of a computer worm, the Morris worm, was by Robert Morris in 1984, 86, something like this. And depending on who you believe, was a lab leak. where he was experimenting on this in his local computer lab and it accidentally got out and then took down basically the entire internet and people were panicked for a while. You know, doing these kinds of attacks could do this, I think.

One thing that is to be said, though, is like, I do not think there has been another example of a lab leak style, you know, Morris worm in the last 50 years for this kind of thing in security. You very rarely have people do this because in security. there is the knowledge of what you can do and then there's like the weaponization of it and like in most papers people write they just don't weaponize it in part because like we have seen an example of what can happen afterwards and you know Maybe...

There is a world that things go too far and you do train the model to be the adversary and you end up with, I don't know, war games or something. But for the most part, I think doing the research on showing whether or not the models have these abilities is probably... just better to do if only because most of these things we're building is like it's not that hard if someone was malicious and wanted to do it they would just do it anyway like the things that we're doing are not like

If I were to spend a year designing a special purpose attack agent kind of thing, maybe that would be not a great thing to do. In the same way, most security researchers don't spend a year designing malware. That's gone too far.

But if you sort of just like put minimal work in in order to prove the proof of concept is easy, this is important to do to show people how easy it is because the people who know it's easy are not going to write the papers and say it's easy. And so you want to have some idea of what are the kinds of things that anyone can do.

so that you can then go and get your defenses in place. That's my current feeling on many of this, many of these things. I'm open to changing my mind in the future, depending on how these language models go. I don't know, but at least As long as things behave roughly like the world we're living in now, I feel like designing these things to construct attacks is not going to by themselves cause harm. If for no other reason than...

most systems today are not robust because they're not vulnerable to Eversol examples. Whether or not Eversol examples are easy is independent of their security. And so even if I had something that happened to be superhuman at this, it actually wouldn't cause much harm.

But if this were to change, and if these things would get a lot better, and if someone were to create an autonomous agent that could magically break into any system, and if it sort of... had the desire to do this in some way whatever this means then maybe this becomes bad i am not as worried about this now for a number of reasons but i could be convinced that this was a problem in a year or two if things rapidly improve and start

getting a lot better. I'm open to changing my mind. I don't currently believe that this is the case, but I never expected that we'd be here where we are with language models now three years ago. And so it's entirely conceivable that in three years, I should change my mind. And so I'm trying.

to tell people that you might think that this is impossible and it is fine to say if it's impossible then everything should be fine but it should be just fine to make the statement if the model was superhuman in almost every way then I'm willing to do things differently than I do now. And I think that the likelihood of this is very small. But if it does happen, then I'm entirely willing to change my mind on these things.

Yeah, staying open minded, I think is super important. The whole field is kind of the dog that caught the car on this. And, you know, who knows how much more could be coming pretty quickly. I wanted to ask to just because it is kind of the at least like. in the public awareness the current frontier of locking down language models is the uh anthropic like eight-layer jailbreaking contest

I think one person maybe just got through the eight layers. Did I see that correctly? I don't, I haven't checked recently. Let me fact check myself on that. But before I do, one thing that was not super intuitive to me about that was like, Why are they focused on a single jailbreak that would do eight different attacks? And is that really...

Yeah, I don't know. It seems like if you could do one, that's plenty to be concerned with. So it seems like a pretty high bar that they've set out for the community there. Yeah, but obviously examples are also very hard. And so like, you know... I think it's an entirely reasonable question to ask. What do we actually want to be achieving with this? Maybe like on one hand, this is a really hard problem to be robust to jailbreaking attacks. So like making partial progress.

is good even if it's only in small areas you know look at official examples in official examples we have spent almost 10 years more than 10 years now trying to solve robustness to an adversary who can perturb every pixel by at most 8 over 255 on an L infinity scale. Why? This problem does not matter.

Because no adversary is like, what about nine pixels? But we set the task because it is a well-defined problem that we can try and solve. And anything else that we really want is strictly harder. And this has been a problem for a very long time. So let's try and solve.

this one particular adversarial example problem in the hopes that we'll have learned something that lets us generalize to the larger field as a whole. So I don't know why they have developed this exact challenge in the way that they did, but if I had designed this... One reason I might have done that is I might have assumed jailbreaks seem really, really hard. It doesn't seem possible that we're going to solve the problem for all jailbreaks all at once.

But the problem of a universal jailbreak where I could just share with all my friends that could immediately, they could copy and paste and apply to all of their prompts that allows this easy proliferation of jailbreaks is a real concern. And I might want to stop this at least very limited form of threat and not make that be so easy. And I'm willing to say like straight up that like this is not going to solve the entire problem completely, but it will make some things harder for some people.

i think this is a reason you might do it but you know on the other hand you're also entirely right that even if you could solve this problem this does not solve the entire problem as a whole because yeah i can just find my jailbreaks one by one

And this is true, but that's a better world to live in than the world we're living in today. This is the world we live in, and most of computer security is someone has to try very hard to exploit your one new particular program. There's not a sequence of magic bytes that makes every program break.

If that was the case, it would be much worse off. And so I'm very glad that they're doing this. I'm also very glad that they're doing this as an open source, not open source, but open, we're calling everyone to be willing to try and test this. They're not just asserting.

we have the following defense and it works. They actually, I think, legitimately do want to know whether or not it actually is effective and they would be happy to have the answer be no and have someone try and break it. If it has been broken recently, I haven't seen. Did you see? So Jan said that... Someone, I guess an individual person, broke through all eight levels, but not yet with a universal jailbreak. So one person broke all eight, but not the same problem.

great okay yeah that's good to know yeah no and so i think what this tells you for at least this defense is you know it It can be broken at the very least even right now using the, you know, for any individual component. And, you know, if what you were guarding was like, you know, national secrets, like this is probably not good enough. But like, you know, in order to like prevent, you know, like insecurity.

we have the concept of a script kitty, which is just someone who's not very intelligent and just copies scripts from someone else that they put online and just uses this to go cause havoc. And they're not going to be able to develop anything new themselves, but they have the tools in front of them and they can run them.

And you could be worried about this in language models where someone just has access to the forum that just has the jailbreak that will jailbreak any model. They copy and paste that, and then they go and do whatever harm they want.

would not be very good and we should prevent against this partial progress is always helpful even if you haven't solved the problem completely it's important to remember that this is the problem we're setting out to solve which is only a subset of the entire problem we're not going to declare victory once we've succeeded here and we're not going to put all of our

effort into solving this one subset of the problem we're going to try and solve everything at the same time too but partial progress is still progress one thing you said there that i thought was really interesting and maybe you can just expand on a bit is really wanting to know I feel like so much of human performance in these domains kind of comes down to, did you really want to know or not? Any reflections on the importance of really wanting to know?

Yeah. I mean, it's like, OK, so some people say, you know, why is it so easy for someone to attack a defense that doesn't publish than the person who initially built it? And I think like maybe half of the answer. is because it's very hard after you've spent six months building something that you really want to work to like actually change your frame of mind and be like, now I really want to break it. Because like, what do you get if you break it? Like you get like not a paper.

because like no one accepts a paper that is like here's an idea i had by the way it doesn't work you're like some people really want to believe in their ideas and like they feel really strong that their ideas are the right ideas that's why they worked on the paper they spent six months on this asking that same person to then like

Okay, now completely change your frame of mind. Imagine this is some other thing that really you need to want to break this. It's a hard thing to do. And I think part of this is beneficial to why you have lots of people to do these security analysis of things, because...

you want to know the answer i feel like this is a thing that people in security have had to deal with for a very long time and you know this is why companies have red teams and these kinds of things because they understand that like you know

There is a different set of skills for security and there is a different set of incentives. And so it's valuable for splitting these things up. It might be hard for one person to do this, but an organization can at least put inside the right principles that can make this be possible sometimes if it's done correctly.

you know it seems like for the most part for most companies who are doing this kind of work they actually do want to know the answer like they're not you know i've talked with most of the people at most of these places and

For the most part, the security people, they're trying their best to get these things as good as possible. They're trying to set these things up in the right way because they actually don't know the answer and they want to figure it out. I certainly have, over time, been very impressed by...

Anthropic in particular, when it comes to repeatedly doing things that seem like they really want to know. And this does seem, you know, very consistent with that. How do you think this, if we circle back to the sort of. Maybe the hardest question that I think is going to face the AI community writ large over the next... Honestly, I don't think it's going to be a super long time, but timelines may vary. The hardest question being...

Is there any way to avoid the concentration of power that comes with just a few companies having the weights and it being totally locked down and nobody else being able to do stuff with the... problems that seem likely to arise if we open source GPT-5 level models that have all these different...

threat models you know it seems like we don't have any way right now to square the circle of like locking down the open source models and it almost seems like it's you know given how all the you know everything we've just walked through in all this detail like

it seems hard to imagine that you could even set up some sort of like mandatory testing or whatever that would give you enough confidence, you know, that then it would be okay. Cause it seems like, yeah. Did you really want to know? And who did you have do it? And, you know,

You've got the sort of bond rater problem that we've seen, you know, another credit with the agency that's getting paid by the model developer to do the certification. We've got all these problems. Where does that leave us? I mean, it seems like we're just in a really tough spot there. Is there any out or any recommendation or is the bottom line just simply that if you open source stuff, you have to be prepared for the fact that the fullness of its capability probably will be exposed?

Yeah, I don't know. This is a very hard question. This is a question I think it'd be great for you to ask some of these people who think about the societal implications of this kind of work. I think the thing I want these people to understand is that...

At least right now and presumably for the near future, we probably can't lock these things down to the degree that they would want. I'm very worried about concentration of power in general. I feel like open source up until today has only been good. for computer security and safety in general. And because of that, it would take a lot for me to change my mind that open source wasn't beneficial. I'm not going to say it's impossible that I would change my mind, you know.

if things continue exponentially and you know if you did give me the magic box that could break into any government system in the world that like anyone could have if they have the tools like i probably would say that shouldn't be something we distribute to everyone in the world

but you know maybe if you have this now you can defend all your systems because you run the magic box and you just like find all the bugs and you patch them and now all of a sudden you're perfectly safe like i don't know like this is it's very hard to reason about these kinds of things in any world that looks

noticeably like our world we're living in today, I think open source is objectively a good thing and is the thing I would be biased towards until I see a very compelling evidence to the contrary. I can't tell you what the evidence would be that would make me believe this right now, but I'm willing to accept that this...

may be something that someone could show me in some number of years. I don't know if your timelines are, you know, either two or 20 years, you know, this may be a thing that I'm willing to change my mind on. And I will definitely say that now. But yeah, I think as long as the people who are making these decisions understand what is technically true, I trust them more than I trust myself to arrive at an answer that's good for society.

because that's what they're an expert on. I'm an expert on telling them whether or not the particular tool is going to work. And would it be fair to say that what is technically true is that the reason open source... has been safe to date and the reason it might not be safe in the future is really a matter of just the raw power of the models. Like we don't have reliable for in the open source context, at least.

We don't have reliable control measures or anything of that sort to prevent people from doing what they want. What we have right now is just models that just aren't that powerful. And so they're not that dangerous, even though people can do what they want. If one thing flips and the other doesn't, we could be in a very different regime. Yeah, this is true. So in the early mid 90s, the US government decided that it was going to try and lock down on cryptography.

It was going to rule encryption algorithms a weapon. And exporting cryptography was weapons munitions export. It was sort of like very, very severe. You cannot do this as a thing. And the reason they were concerned is because now anyone in the world has literally military grade encryption. They cannot be broken by anyone. Like they can talk to anyone in secret.

that the governments won't be able to break it and so like this is like a weapon national secret you can't export it this is why early encryption algorithms that were exported in like yeah the very early web browsers were 40 bits, because this was the limit that you could export less than 40 bit cryptography, but not higher than 40 bit cryptography. And there was a whole argument around like,

Is giving every random person who wants it access to a really high level of encryption actually just like a weapon that they can use to have secret communication that literally no one in the world can break? I think it was objectively a good thing. that we decided that we should give everyone in the world high levels of encryption algorithms. Yes, it supports the ability of terrorist cells to communicate in a way that can never be broken. But the benefit of

Every person in the world can now have a bank account that they can access remotely without having to go in, and you have payment online, and you have dissidents who can now communicate again with perfect security. It was objectively worth that trade-off. And the people at the time who were making the decisions were not thinking about all of the positive possible futures and were thinking only about this one particular bad thing that might happen. I have no idea how this calculus plays out.

for language models. It really depends on exactly how capable you think they're going to become. I don't know what that's going to be, but I think... I'm particularly worried about concentration of power, just because this is something that does not need to assume super advanced levels of capabilities. Like imagine that models get better than they are now, but don't become superhuman in any meaningful way.

I think concentration of power is a very real risk that I would really be advocating for open source in this world, because otherwise you end up with some small number of people who have the ability to do these things that no one else can do.

And this is not a thing that needs to assume some futuristic, malicious model kind of thing. We know some people in the world just like to accumulate power, and so I'm worried about that. And so distributing this technology is very, very good for that reason.

I think it's very hard to predict what will happen with better models and I think that's why this question is harder than the one that when it was for cryptography because it was very clear at that time like the limits of this is like it gives everyone perfect encryption we don't know what the consequences might be but like you're not going to get more perfect encryption than perfect encryption

Whereas with these models we have now, I don't know, maybe they continue scaling, maybe they don't. Maybe they scale, but it takes 50 years and it gives society time to adapt. Maybe they scale and it takes two years and society doesn't have time to adapt. I don't know where this is going. This is why I want these policy people to be thinking about this more than I am. Because I think as long as they're willing to think through the consequences of this, like...

These are the kinds of people who I think can give you better answers than this because there are reasons to believe that the answer might be we should be concerned and there are reasons to believe the answer would be we should obviously distribute to as many people as possible.

I think if anyone were to say that one of these answers is obviously objectively right with 100% certainty, that they're way overconfident. I tend to be biased based on what has been true in the past. Some people tend to be biased based on what they think might be true in the future.

I think these are reasonable positions to take as long as you're willing to accept that the other one might be also correct. And so we'll see how things go. I think the next two years will give us a much better perspective on how all this stuff is going.

hopefully we'll become much more clear if these things are working out. I think if they do, we'll learn very fast that they are. And if they don't, then limitations probably will start to hit in the next couple of years. And that should give us a lot more clarity. The question is, is that too late? I don't know. Ask a policy person these questions.

Yeah, no easy answers. But the one thing I do always feel pretty comfortable and confident asserting is it's not a domain for ideology. It is really important to look the facts square on and... try to confront them really as they are. So I appreciate all the hard work that you've put in over years, lighting that path and showing what the realities in fact are. And I really appreciate all the time that you've...

Spent with me today, bringing it all to light. So anything else you want to share before we break? Could be, you know, call for collaborators or anything. No, I don't have anything in particular. Yeah, I just, I want. more people to do good science on these kinds of things. I think it's very easy for people to try and make decisions when they're not informed about what the state of the world actually looks like.

My biggest concern is that we will start having people make decisions who are uninformed and in important ways. And as a result, we'll have to regret the decisions that we made. If all the decisions are made based on the best available facts at the time.

like that's really the best you can hope for is like maybe you made the wrong decision but you looked at all the knowledge the thing i'm worried about is we will have the knowledge and people will just like make decisions independent about what is true in the world because based on some ideology of what they want to be true in the world And as long as that doesn't happen, maybe we get it wrong, but at least we get our best shot. Well, you keep...

breaking things and figuring out what's true out there. And maybe we can check in again before too long and do an update on what the state of the world is and try to make sure policymakers are aware. For now, this has been excellent. Nicholas Carlini, prolific security researcher at Google DeepMind. Thank you for being part of the Cognitive Revolution. Thank you for having me.

It is both energizing and enlightening to hear why people listen and learn what they value about the show. So please don't hesitate to reach out via email at tcr at turpentine.co. Or you can DM me on the social media platform of your choice.

This transcript was generated by Metacast using AI and may contain inaccuracies. Learn more about transcripts.