Nicholas Carlini (Google DeepMind) - podcast episode cover

Nicholas Carlini (Google DeepMind)

Jan 25, 20251 hr 21 min
--:--
--:--
Listen in podcast apps:

Episode description

Nicholas Carlini from Google DeepMind offers his view of AI security, emergent LLM capabilities, and his groundbreaking model-stealing research. He reveals how LLMs can unexpectedly excel at tasks like chess and discusses the security pitfalls of LLM-generated code.


SPONSOR MESSAGES:

***

CentML offers competitive pricing for GenAI model deployment, with flexible options to suit a wide range of models, from small to large-scale deployments.

https://centml.ai/pricing/


Tufa AI Labs is a brand new research lab in Zurich started by Benjamin Crouzier focussed on o-series style reasoning and AGI. Are you interested in working on reasoning, or getting involved in their events?


Goto https://tufalabs.ai/

***


Transcript: https://www.dropbox.com/scl/fi/lat7sfyd4k3g5k9crjpbf/CARLINI.pdf?rlkey=b7kcqbvau17uw6rksbr8ccd8v&dl=0


TOC:

1. ML Security Fundamentals

[00:00:00] 1.1 ML Model Reasoning and Security Fundamentals

[00:03:04] 1.2 ML Security Vulnerabilities and System Design

[00:08:22] 1.3 LLM Chess Capabilities and Emergent Behavior

[00:13:20] 1.4 Model Training, RLHF, and Calibration Effects


2. Model Evaluation and Research Methods

[00:19:40] 2.1 Model Reasoning and Evaluation Metrics

[00:24:37] 2.2 Security Research Philosophy and Methodology

[00:27:50] 2.3 Security Disclosure Norms and Community Differences


3. LLM Applications and Best Practices

[00:44:29] 3.1 Practical LLM Applications and Productivity Gains

[00:49:51] 3.2 Effective LLM Usage and Prompting Strategies

[00:53:03] 3.3 Security Vulnerabilities in LLM-Generated Code


4. Advanced LLM Research and Architecture

[00:59:13] 4.1 LLM Code Generation Performance and O(1) Labs Experience

[01:03:31] 4.2 Adaptation Patterns and Benchmarking Challenges

[01:10:10] 4.3 Model Stealing Research and Production LLM Architecture Extraction


REFS:

[00:01:15] Nicholas Carlini’s personal website & research profile (Google DeepMind, ML security) - https://nicholas.carlini.com/


[00:01:50] CentML AI compute platform for language model workloads - https://centml.ai/


[00:04:30] Seminal paper on neural network robustness against adversarial examples (Carlini & Wagner, 2016) - https://arxiv.org/abs/1608.04644


[00:05:20] Computer Fraud and Abuse Act (CFAA) – primary U.S. federal law on computer hacking liability - https://www.justice.gov/jm/jm-9-48000-computer-fraud


[00:08:30] Blog post: Emergent chess capabilities in GPT-3.5-turbo-instruct (Nicholas Carlini, Sept 2023) - https://nicholas.carlini.com/writing/2023/chess-llm.html


[00:16:10] Paper: “Self-Play Preference Optimization for Language Model Alignment” (Yue Wu et al., 2024) - https://arxiv.org/abs/2405.00675


[00:18:00] GPT-4 Technical Report: development, capabilities, and calibration analysis - https://arxiv.org/abs/2303.08774


[00:22:40] Historical shift from descriptive to algebraic chess notation (FIDE) - https://en.wikipedia.org/wiki/Descriptive_notation


[00:23:55] Analysis of distribution shift in ML (Hendrycks et al.) - https://arxiv.org/abs/2006.16241


[00:27:40] Nicholas Carlini’s essay “Why I Attack” (June 2024) – motivations for security research - https://nicholas.carlini.com/writing/2024/why-i-attack.html


[00:34:05] Google Project Zero’s 90-day vulnerability disclosure policy - https://googleprojectzero.blogspot.com/p/vulnerability-disclosure-policy.html


[00:51:15] Evolution of Google search syntax & user behavior (Daniel M. Russell) - https://www.amazon.com/Joy-Search-Google-Master-Information/dp/0262042878


[01:04:05] Rust’s ownership & borrowing system for memory safety - https://doc.rust-lang.org/book/ch04-00-understanding-ownership.html


[01:10:05] Paper: “Stealing Part of a Production Language Model” (Carlini et al., March 2024) – extraction attacks on ChatGPT, PaLM-2 - https://arxiv.org/abs/2403.06634


[01:10:55] First model stealing paper (Tramèr et al., 2016) – attacking ML APIs via prediction - https://arxiv.org/abs/1609.02943

Transcript

The fact that it can make valid moves almost always means that it must, in some sense, have something internally that is accurately modeling the world. I don't like to ascribe intentionality or these kinds of things, but it's doing something that allows it to make these moves knowing what the current board state is and understanding what it's supposed to be doing.

Everyone means something different by reasoning. And so the answer to the question, is that reasoning, is entirely what you define as reasoning. And so you find some people who are very much in the world of, I don't think models are smart, I don't think that they're good, they can't solve my problems. And so they say, no, it's not reasoning, because to me, reasoning means, and then they give a definition which excludes language models.

And then you ask someone who's very sort of much on the AGI, you know, language models are going to solve everything. By 2027, they're going to be displaced all human jobs. You ask them, what is reasoning? And they say reasoning is... Hi, so I'm Nicholas Carlini. I'm a research scientist at Google DeepMind. And I like to try and make models do bad things.

and understand the security implications of the attacks that we can get on these models. I really enjoy breaking things and have been doing this for a long time. But I'm just very worried that because they're impressive, we're going to have them applied in all kinds of areas where they ought not be. And why, as a result, the attacks that we have on these things are going to end up with bad security consequences.

MLST is sponsored by CentML, which is the compute platform specifically optimized for AI workloads. They support all of the latest open source language models out of the box, like Llama, for example. You can just choose the pricing points, choose the model that you want. It spins up, it's elastic autoscale. You can pay on consumption, essentially, or you can have a model which is always working or it can be freeze dried when you're not using it. So what are you waiting for? Go to centml.ai and sign up now.

Tufa Labs is a new AI research lab I'm starting in Zurich. It is funded from past ventures involving AI as well. We are hiring both chief scientists and deep learning engineer researchers. And so we are a Swiss version of DeepSeq. And so a small group of people, very, very motivated, very hardworking. And we try to do some AI research starting with LLM and O1 style models. We want to investigate reverse engineer.

and explore the techniques ourselves. Nicholas Carlini, welcome to MLST. Thank you. Folks at home, Nicholas won't need any introduction whatsoever. Definitely by far the most famous security researcher in ML and working at Google and it's so amazing to have you here for the second time. Yeah, the first time, yeah, it was a nice pandemic one, but no, it was great.

Yes, MLST is one of the few projects that survived the pandemic, which is pretty cool. But why don't we kick off then? So do you think we'll ever converge to a state in the future where our systems are insecure and we're just going to learn to live with it? I mean, that's what we do right now, right? In normal security. There is no perfect security for anything. If someone really wanted you to have something bad happen on your computer, they would win.

There's very little you could do to stop that. We just rely on the fact that probably the government does not want you in particular to have something bad happen. If they decided that, I'm sure that they have something that they could do that they would succeed on. What we can get into a world of is the average person probably can't succeed in most cases. This is not where we are with machine learning yet.

With machine learning, the average person can succeed almost always. So I don't think our objective should be perfection in some sense. But we need to get to somewhere where it's at least the case that a random person off the street can't just really, really easily run some off-the-shelf GitHub code that makes it so that some model does arbitrary bad things in arbitrary settings. Now, I think getting there is going to be very, very hard. We've tried, especially in vision.

for the last 10 years or something to get models that are robust and we've made progress we've learned a lot but if you look at like the objective metrics like they have not gone up by very much in like the last four or five years at all and this makes it seem somewhat unlikely that we're going to get perfect robustness here in this foreseeable future but at least we can still hope that we can do research and make things better and eventually we'll get there

And I think we will, but it just is going to take a lot of work. So Ilya asked me to ask you this question. Do you ever think in the future that it'll become illegal to hack ML systems? I have no idea. I mean, it's very hard to predict these kinds of things. It's very hard to know, is it already? Especially in the United States, the Computer Fraud and Abuse Act covers who knows what in whatever settings.

I don't know. I think this is a question for the policy and the lawyer peoples. And my view on policy and law is as long as people are making these decisions coming from a place of what is true in the world, they can make their decisions. The only thing that I try and make comments on here is like, let's make sure that at least we're making decisions based on what is true and not decisions based on what we think the world should look like. And so, you know, if they base their decisions around the fact that

we can attack these models and various bad things could happen, then they're more experts at this than me and they can decide what they should do. But yeah, I don't know. But in the context of ML security, I mean, really open-ended questions just to start with. Sure. Can you predict the future? What's going to happen? Future for ML security.

Okay, let me give you a guess. I think the probability of this happening is very small, but it's sort of like the median prediction, I think, in some sense. I think models will remain vulnerable to fairly simple attacks for a very long time. And we will have to find ways of building systems so that we can rely on an unreliable model and still have a system that remains secure. And what this probably means...

is we need to figure out a way to design the rest of the world, the thing that operates around the model so that if it decides that it's going to just randomly classify something completely incorrectly, even if just for random chance alone, the system is not going to go and perform a terribly misguided action and that you can correct for this, but that we're going to have to live with a world where the models remain very vulnerable for, yeah, I don't know.

for the foreseeable future, at least as far as I can see. And, you know, especially machine learning time, five years is an eternity. I have no idea what's going to happen with, you know, what the world will look like if there's machine learning language models who know something else might happen. Language models are only like, you know, seven years of like real significant progress. So like predicting five years out is like almost doubling this. So I don't know how the world there will look. But at least as long as we're in the current paradigm, it looks like we're in this world where things are.

fairly vulnerable but then again language models are only you know seven years and we've only been trying to attack them for like really two or three uh so give five years that's twice as long as we've been trying to attack these language models maybe we just figure everything out maybe language models are fundamentally different and things aren't this way but um my prior just tends to be the case of other vision models we've been trying to study for 10 years and at least there things have been proven very hard and so i

My expectation is things will be hard. And so I'll have to just rely on building systems that end up working. Amazing. So I've been reading your blog and everyone should read his blog because it's really amazing.

And actually, when you first put out this article about chess playing, I've cited it on the show about 10 times. So it's really, really interesting. But let me read a bit. By the way, it's called Playing Chess with Large Language Models. So you said, until this week, in order to be good at chess, a machine learning model had to be explicitly designed to play games, had to be told explicitly that there was an 8x8 board and that there were different pieces and how each of them moved and what the goal of the game was. And, you know, it had to be trained with reinforcement learning against itself. And then it would win.

And you said that this all changed at the time on Monday when OpenAI released GPT 3.5 Turbo Instruct. Can you tell me about that? What GPT 3.5 Turbo Instruct and later other people have done with open source models that you can verify they're not doing something weird behind the scenes. Because I think some people speculate, well, maybe they're just cheating in various ways. But there are open source models that replicate this now. What you have is you have a language model that...

can play chess to a fairly high degree and yeah okay so when you first tell someone i have a machine learning system that can play chess the immediate reaction you get is like why should i care you know we had deep blue whatever 30 years ago

that could like beat the best humans and like isn't that some form of like you know a little bit of ai at the time like why should i be at all surprised by the fact that i have some system like this that can play chess and that yeah so the the fundamental difference here i think is very interesting is that the model was trained on a sequence of moves so in chess you represent moves you know

One, E4 means, you know, move the king's pawn E4. And then you have, you know, E5, black responds. And then two, whatever. And F3, white plays the knight, whatever. You train on these sequences of moves. And then you just say six dot language model, do your prediction task. It's like just a language model. It is being trained to predict the next token. And it can play a move that not only is valid, but also is very high quality. And.

This is interesting because it means that the model can play moves that accurately... Let's just talk about the valid part in the first place. Valid is interesting in and by itself. Because what is a valid chess move is a complicated program to write. It's not an easy thing to do to describe what moves are valid in what situations. You can't just be...

dumping out random characters and stumble upon valid moves. And you have this model that makes valid moves every time. And so I don't like talking a lot about, you know, what's the model doing internally, because I don't think that's all that helpful. I think, you know, just look at the input output behavior of the system as the way to understand these things. But the fact that it can make valid moves almost always means that it must in some sense have something internally.

that is accurately modeling the world. I don't like to ascribe intentionality or these kinds of things. But it's doing something that allows it to make these moves knowing what the current board state is and understanding what it's supposed to be doing. And this by itself, I think, is interesting. And then not only can it do this, it can actually play high-quality moves.

And so I think, you know, taken together, it in some sense tells me that the model has a relatively good understanding of what the actual position looks like. Because, you know, okay, so I play chess at a modest level, like I'm not terrible, I understand, you know, more or less what I should be doing. But if you just gave me a sequence of 40 moves in a row, and then said, you know,

41 point like what's the next move like I I could not reconstruct in my mind what the board looked like at that point in time somehow the model has figured out a way to do this like having never been told anything about the rules but that like they even exist as poor like it's sort of reconstructed all of that and it can put the pieces on the board correctly in whatever way that it does it internally who knows how that happens and then it can place the valid move like

It's sort of very interesting that this is something you can do. For me, it changed the way that I think about what models can and can't do in surface-level statistics or more deeper statistics about actually what's going on. I don't know. This is, I guess, mainly why I think this is an interesting thing about the world. Yeah, we have this weird form of human chauvinism around the abstractness of our understanding.

And these artifacts have a surface level of understanding, but it's at such a great scale that at some point it becomes a weird distinction without a difference. But you said something very interesting in the article. You said that the model was not playing to win, right? And you were talking about, and I've said this on the show, that the models are a reflection of you.

So you play like a good chess player and it responds like a good chess player. And it's like that, whether you're doing coding, whether you're doing it. And it might even explain some of the differential experiences people have because you go on LinkedIn and those guys over there clearly aren't getting very good responses out of LLMs. But then folks like yourself, you're using LLMs and you're at the sort of the galaxy brain level where you're sort of like pushing the frontier and people don't even know you're using LLMs. So there's a very differential experience. Yeah.

Yeah, okay, so let me explain what I mean by I said that. So if you take a given chess board, you can find multiple ways of reaching that. You know, you could take a board that happened because of a normal game between two chess grandmasters, and you can find a sequence of absurd moves that no one would ever play that actually brings you to the board state. You know, so what you do is, like, piece by piece, you say, well, the knight is on G3.

So what I'm going to do is I'm just going to first move the white knight just whatever random spot and put it on g3. Okay, and now I know the bishop is on, you know, whatever, you know, h2. And I'll find a way of moving the pawn out of the way and then putting the bishop on h2. And you can come up with a sequence of absurd moves that ends up in the correct board state. And then you can ask the model, now play a move. Okay, and then what happens? The model plays a valid move. Still, most of the time it knows what the board state looks like. But the move that it plays...

is very very bizarre it's like a very weird move why because what has been the model been trained to do the model was was never told the game of chess to win the model was told make things that are like what you saw before it saw a sequence of moves that looked like two people who were rated like you know negative 50 playing a game of chess and it's like well okay i guess the game is to like just make valid moves and just see what happens and they're very good at doing this and you can do this in this both in the synthetic way

And also what you can do is you can just find some explicit cases where you can just get models to make terrible move decisions just because that's what people do commonly when they're playing. And, you know, most people fall for this trap. And I was modeled to play like whatever the training data looked like. And so I guess I ought to fall for this trap. And, you know, this is one of the problems of these models is they're not initially trained to do the like play to win thing.

Now, as far as how this applies to actual language models that we use, we almost always post-train the models with RLHF and SFT instruction fine tuning things. And a big part of why we do that is so that we...

don't have to deal with the this mismatch between what the model was initially trained on and what we actually want to use it for and you know this is why gpt3 is exceptionally hard to use and the the sequence of instruct papers was very important is that it takes the capabilities that the model has somewhere behind the scenes and makes them much easier to reproduce and so when you're using a bunch of the chat models today most of the time you don't have to worry nearly as much

exactly how you frame the question because of this, you know, they were designed to give you the right answer even when you ask the silly question. But I think they still do have some of this. But I think it's maybe less than if you just have the raw base model that was being trained on whatever data happened to be trained on. Yeah, I'd love to do a tiny digression on RLHF because I was speaking with Max from Cohere yesterday.

They've done some amazing research talking all about, you know, how this preference steering works. And they say that like humans are actually really bad at kind of like distinguishing a good thing from another thing, you know. So we like confidence. We like verbosity. We like complexity. And for example, I really hate the chat GPT model because of the style. I can't stand the style. So even though it's right, I think it's wrong, you know. So when we do that kind of post-training on the language models, how does that affect the competence?

I don't know. Yeah, I mean, I feel like it's very hard to answer some of these questions because oftentimes you don't have access to the models before they've been post-trained. You can look at these numbers from the papers. So in the GPT-4 technical report, one of these reports, they have some numbers that show that the model before it's been post-trained, so just the raw-based model, is very well calibrated. And what this means is when...

it gives an answer with some probability it's right about that right that probability of the time so if it says you know that if it gives an answer for like you know give us a math question and it says the answer is five and the token probability is 30 it's right about 30 of the time but then when you do the post-training process the calibration gets all messed up and it doesn't have this behavior anymore so i like that some things change you know you you can often have the models that just like get

fantastically better when you do post-training because now they follow instructions much better you haven't really taught them all much new but it looks like it's much smarter yeah i think this is all a very confusing thing i don't have a good understanding of of how all of these things fit together i mean given you know these models that they they make valid moves they they appear to be competent but sometimes they have these catastrophic weird failure modes yes so do we do we call that process reasoning or not

I'm very big on not ascribing intentionality or I don't want to everyone means something different by reasoning. And so the answer to the question is that reasoning is entirely

what you define as reasoning. And so you find some people who are very much in the world of, I don't think models are smart. I don't think that they're good. They can't solve my problems. And so they say, no, it's not reasoning. Because to me, reasoning means, and then they give a definition which excludes language models. And then you ask someone who's very sort of much on the AGI, you know, language models are going to solve everything. By 2027, they're going to be displaced all human jobs. You ask them, what is reasoning? And they say reasoning is...

And whatever the process is that the model is doing, and then they tell you, yes, they're reasoning. And so I think, you know, it's very hard to talk about, like, whether it's actually reasoning or not. I think the thing that we can talk about is, like, what is the input of a behavior? And, you know, does the model do the thing that answers the question, solves the task, and was challenging in some way? And, like, did it get it right? And then we can...

go from there and i think this is an easier way to try and answer these questions than to ascribe intentionality to something like it's like i don't know it's just really hard to to have these debates with people when when you start off without having the same definitions i know i'm really torn on this because as you say the deflationary

methodology is it's an input output mapping you could go one step up so Benjo said the reasoning is basically knowledge plus inference you know in some some probabilistic sense and I think it's about knowledge acquisition or the recombination of knowledge and then it's the same thing with agency right you know that the simplistic form is that it's just like you know an automata it's just like a you know you have like an environment and you have some computation and you have an action space and it's just this thing you know but

It feels necessary to me to have things like autonomy and emergence and intentionality in the definition. But you could just argue, well, why are you saying all of these words? Like, if it does the thing, then it does the thing. Yeah, and this is sort of how I feel. I mean, I think it's very interesting to consider this, like, is it reasoning? If you have a background in philosophy and that's what you're going for.

I don't have that. So I don't feel like I have any qualification to tell you whether or not the model is reasoning. I feel like the thing that I can do is say, here is how you're using the model. You want it to perform this behavior. Let's just check, like, did it perform the behavior? Yes or no. And if it turns out that it's doing the right thing in all of the cases, I don't know that I care too much about whether or not

the model reasoned its way there or it used a lookup table. Like if it's giving me the right answer every time, like let's, I don't know. I tend to not focus too much on how it got there. Although we have this entrenched sense that we have parsimony and robustness, you know, for example, in this chess notation.

If you changed the syntax of the notation, it probably would break, right? Yes. There are multiple chess notations. And I have tried this. So before there was the current notation we use, in old chess books, notation was like, you know, king's bishop moves to queen's three, whatever. You just number the squares differently.

If you ask a model in this notation, it has no idea what's happening. And it will write something that looks surface level like a sequence of moves, but has nothing to do with the correct board state. And of course, yeah, a human would not do this if you ask them to produce the sequence of moves. It would take me a long time to remember which squares, which things, how to write these things down. I would have to think harder.

I understand what the board is, and I can get that correct. And the model doesn't do that right now. And so maybe this is your definition of reasoning, and you say the reasoning doesn't happen. But someone else could have said, why should you expect the model to generalize this thing that it's never seen before? It's interesting to me. We've gone from a world where we wrote papers.

about the fact that if you trained a model on ImageNet, then like, well, obviously it's going to have this failure mode that when you corrupt the images, the accuracy goes down or you can't, I'm like, suppose I wrote a paper seven years ago. I trained my model on ImageNet and I tested it on CIFAR 10. It didn't work. Isn't this model so bad? People would like laugh at you. Like, well, of course you trained it on ImageNet, one distribution. You tested it on a different one. You never asked it to generalize.

And it didn't do it. Like, good job. Like, of course it didn't solve the problem. But today, what do we do with language models? We train them on one distribution. We test them on a different distribution that it wasn't trained on sometimes. And then we laugh at the model, like, isn't it so dumb? It's like, well, yes, you didn't train on the thing. You know, maybe some future model, you could have the fact that it could just magically generalize across domains. But like, we're still using machine learning. Like, you need to train it on the kind of data that you want to test it on. And then the thing will behave much better than if you don't do that.

So in an email correspondence to me, you said something, you didn't use these exact words, but you said that there are so many instances where you kind of feel a bit noobed because you made a statement, you know, your intuition is you're a bit skeptical. You said there's stochastic parrots and then you got proven wrong a bunch of times. And it's the same for me. Now, one school of thought is, you know, Rich Sutton, you just throw more data and compute at the thing. And the other school of thought is that we need to have completely different methods. I mean, are you still amenable to the idea that just scaling these things up will do the kinds of reasoning that we're talking about?

Possibly. Yeah. Right. So there are some people I feel like who have good visions of what the future might look like. And then there are people like me who just look at what the world looks like and then try to say, well, let's just do interesting work here. I feel like this works for me because for security in particular.

It really only matters if people are doing the thing to attack the thing. And so I'm fine just saying like, let's look at what is true about the world and write the security papers. And then if the world significantly changes, we can try and change. And we can try and be a couple years ahead looking where things are going so that we can do security ahead of when we need to. But I tend, because of the area that I'm in, not to spend a lot of time trying to think about like, where are things going to be?

in the far future i think a lot of people try to do this and some of them are good at it and some of them are not and i have no evidence that i'm good at it so i try and mostly reason based on what i can observe right now and if what i can observe changes then i ought to change what i'm thinking about these things and do things differently and that's the best that i can hope for

on this chess thing has anyone studied you know like in in in the headers for the chess notation you you could say uh this player had an elo of 2500 or something like that and i guess the first thing is like do do you see some commensurate you know change in performance but what would happen if if you if you said elo 4000 right um yes um we've actually trained some models trying to do this doesn't work very well it's like you can't like you can't trivially

at least yeah if you just change the number we've trained some models ourselves on on headers that we expected would have even even better chance of doing this and it did not directly give this kind of immediate wins which again is not to say that like i am not good at training models um someone else who knows what they're doing might have been able to make it have this behavior but when we trained it and when we tested 3.5 term or instruct it like it might have a statistically significant difference on the outcome

but it's nowhere near the case that you tell the model it's playing like a 1000 rated player and all of a sudden it's 1000 rated. People have worked very hard to try and train models that will let you match the skill to an arbitrary level. And like, it's like research paper level thing, not just like change three numbers in the header and hope for the best. Right. So you wrote another article called Why I Attack.

Sure. And you said that you enjoy attacking systems for the fun of solving puzzles rather than altruistic reasons. Can you tell me more about that? But also, why did you write that article? Yeah, okay. Okay, so let me answer them in the opposite order you ask them. So why did I write the article? Some people were mad at me for breaking defenses.

They said that I don't care about humanity. I just, I don't know, want to make them look bad or something. And half of that statement is true. I don't do security because I'm not driven by I want to do maximum good and therefore I'm going to think about what are all of the careers that I could do and try and find the one that's most likely to save the most lives.

You know, if I had done that, I probably would, I don't know, be a doctor or something like, you know, actually like immediately helps people or you could research on cancer, like find whatever domain that you wanted where you could like measure like maximum good. I don't find those things fun. I can't motivate myself to do them. And so if I was a different person, maybe I...

Maybe I could do that. Maybe I could be someone who like could meaningfully solve challenging problems in biology by saying like, I'm waking up every morning knowing that I'm sort of like saving lives or something. But this is not how I work. And I feel like it's not how lots of people work. You know, there are lots of people who I feel like are in computer science and or you want to go even further in like quant fields where like you're clearly brilliant and you could.

be doing something a lot better with your life. And some of them probably legitimately just would just have zero productivity if they were doing something that they just really did not find any enjoyment in. And so I feel like the thing that I try and do is, okay, find the set of things that you can motivate yourself to do and like will do a really good job in and then solve those as good as possible.

subject to the constraint that like you're actually net positive moving things forwards and for whatever reason i've always enjoyed attacking things and i'm i feel like i'm differentially much better at that than at anything else and like i feel like i'm pretty good at doing the adversarial machine learning stuff but i have no evidence that i would be at all good at the other you know 90 of things that exist in the world that might do better and so

I don't know, the way that I, maybe one sentence that I think about this is the like, that's how good you are at the thing multiply by how much the thing matters. And you're trying to sort of maximize that product. And if there's something that you're really good at that at least directionally moves things in the right direction, you can have a better, higher impact than taking whatever field happens to be the one that is like maximally good and moving things forwards by a very small amount.

And so that's why I do attacks is because I feel like generally they move things forward and I feel like I'm better than most other things that I could be doing. Now, you also said that attacking is often easier than defending. Certainly. Tell me more. I mean, it's the standard thing in security. You need to find one attack that works and you need to fix all of the attacks if you're defending. And so if you're attacking something,

The only thing that I have to do is find one place where you've forgotten to handle some corner case. And I can arrange for the adversary to hit that as many times as they need until they succeed. This is why you have normal software security. You can have a perfect program in everywhere except one line of code where you forget to check the bounds exactly once.

And what does this mean? The attacker will make it so that that happens every single time and the security of your product is essentially zero. Under random settings, this is never going to happen. It's never going to happen that the hash of the file is exactly a power of like, you know, is equal to 2 to the 32, which overflows the integer, which causes the bad stuff to happen. This is not going to happen by random chance, but the attacker can just arrange for this to happen every time, which means that it's much easier for the attacker than the defender who has to fix all of the things. As this.

And then in machine learning, it gets even worse because at least in normal security and software security or other areas, we understand the classes of attacks. In machine learning, we just constantly discover new categories of bad things that could happen. And so not only do you have to be robust to the things that we know about, you have to be robust to someone coming up with a new clever type of attack that we hadn't even thought of before and be robust there. And this is not happening because of...

I mean, it's a very new field. And so, of course, it's just much easier for these attacks than defenses. Let's talk about disclosure norms. How should they change now that we're in the ML world? Okay, yeah. So in standard software security, we've basically figured out how things should go. So for a very long time, you know, for 20 years, there was a big back and forth between...

When someone finds a bug in some software that can be exploited, what should they do? And let's say, I don't know, late 90s, early 2000s, there were people who were on the full disclosure, which they thought, I find a bug in some program, what should I do? I should tell it to everyone so that we can make sure that people don't make a similar mistake and we can put pressure on the person to fix it and do all that stuff. And then there were the people who were on the...

don't disclose anything. Like you should report the bug to the person who's responsible and wait until they fix it. And then you should tell no one about it. And because, you know, this was a bug that they made and you don't want to give someone else, anyone else ideas for how to exploit. And in software security, we landed on this, you know, what was called responsible disclosure and is now coordinated disclosure, which is the idea that you should give the person, if it affects one person, a reasonable heads up for some amount of time.

Google Project Zero has a 90-day policy, for example. And you have that many days to fix your thing. And then after that, or once it's fixed, then it gets published to everyone. And the idea here in normal security is that you give the person some time to protect their users. You don't want to immediately disclose a new attack that allows people to cause a lot of harm. But...

You put a deadline on it and you stick to the deadline to put pressure on the company to actually fix the thing. Because what often happens if you don't say you're going to release things publicly is no one else knows about it. You're the only one who knows the exploit. They're just going to not do it because they're in the business of making a product, not fixing bugs.

And so why would they fix it if no one else knows about it? And so when you say like, no, this will go live in 90 days, like you better fix it before then. They have the time. It's just like now if they don't do it, it's on them because they just didn't put in the work to fix the thing. And there are, of course, exceptions. You know, Spectre and Meltdown are two of the most common exploits or like one of the biggest attacks in the last 10, 20 years in software security. And they gave Intel and related people a year.

to fix this because it was a really important bug. It was a hard bug to fix. There were legitimate reasons why you should do this. There's good evidence that it's probably not going to be independently discovered by the bad people for a very long time. And so they gave them a long time to fix it. And similarly, Google Project Zero also says if they find evidence the bug is being actively exploited, they'll give you seven days. If there's someone actually exploiting it, then you have seven days before they'll patch because the harm is already being done.

And so they might as well tell everyone about the harms being done because if they don't, then it's like just going to delay the things. Okay. So with that long preamble, how should things change for machine learning? The short answer is, I don't know, because on one hand, I want to say that this is like how things are in software security. And sometimes it is where, you know, someone has some, some bug in their software and there exists a way that they can patch it and fix the problems.

And in many cases, this happens. So we've written papers recently, for example, where we've shown how to do some model stealing stuff. So OpenAI has a model, and we could query OpenAI services and allow us to steal part of their model. Only a very small part, but we could steal part of it. So we disclosed this to them because there was a way that they could fix it. They could make a change to the API to prevent this attack from working, and then we write the paper and put it online.

This feels very much like software security. On the other hand, there are some other kinds of problems that are not the kinds that you can patch. Let's think in the broadest sense, adversarial examples. If I disclosed to you, here is an adversarial example on your image classifier. What is the point of doing the responsible disclosure period here? Because there is nothing you can do.

to fix this in the short term. We have been trying to solve this problem for 10 years. Another 90 days is not going to help you at all. Maybe I'll tell you out of courtesy to let you know, this is the thing that I'm doing. I'm going to write this paper. Here's how I'm going to describe it. Do you want to put in place a couple of filters ahead of time to make this particular attack not work? But you're not going to solve the underlying problem. And when I talk to people who do biology things, the argument they make is, you know,

Suppose someone came up with a way to create some novel pathogen or something. A disclosure period doesn't help you here. And so is it more like that or is it more like software security? I don't know. I'm more biased a little bit towards the software security because that's where I came from. But it's hard to say exactly which one we should be modeling things after. I think we do probably need to come up with new norms for how we handle this. There are a lot of people I know who are talking about this, trying to write these things down. And I think...

In a year or two, if you ask me this again, we will have set processes in place. We will have established norms for how to handle these things now. I think this is just like very early. And right now we're just looking for analogies in other areas and trying to come up with what sounds most likely to be good. But I don't have a good answer for you. Yeah, immediately now. Are there any vulnerabilities that you've decided not to pursue for ethical reasons?

No, not that I can think of. But I think mostly because I tend to only try and think of the exploits that would be ethical in the first place. It may happen that I stumble upon this, but I tend to... I think research ideas, in some very small fraction of the time, research ideas happen just by...

random inspiration most of the time though research ideas is not something that just happens like you have spent conscious effort trying to figure out what new thing i'm going to try and do and i think it's pretty easy to just like not think about the things that seem morally fraught and just focus on the ones that seem like they actually have potential to be good and useful uh but

It very well may happen at some point that this is something that happens, but this is not a thing that I... I can't think of any examples of attacks that we've found that we've decided not to publish because of the harms that they would cause. But I can imagine that this might be something that... I can't rule out this is something that wouldn't happen, but I tend to just bias my search of problems in the direction of...

things that i think are actually beneficial this i mean maybe going back to like the why i attack things you want the product of how good you are and you know how much good it does for humanity to be maximally positive you can choose what problems you work on to not be the ones that are negative and so like you know i don't have lots of respect for people where the direction of the like goodness of the world is like just a negative number because like you can just you can choose to make that at very least zero just like don't do anything

And so, you know, I try and pick the problems that I think are generally positive. And then among those, yeah, just do as good as possible on those ones. So you work on traditional security and ML security. What are the significant differences? Yeah. Okay. So I don't work too much on traditional security anymore. So I started my PhD in traditional security. Yeah, I was like, I did very, very low level return into programming. I was at Intel for a summer on some hardware level defense stuff.

And then I've started machine learning shortly after that. So I haven't worked on the very traditional security in like the last, let's say eight, something, seven, something years. But yeah, I still follow it very closely. I still go to the system security conferences all the time because I think it's like a great community. But yeah, what are the similarities and differences? I feel like the systems security people are very good at

really trying to make sure that what they're doing is like a very rigorous thing and like evaluated it really thoroughly and properly. You know, you see this even in like the length of the papers. So a system security paper is like 13, 14 pages long, two column.

A paper that's a submission for iClear is like seven or eight or something, one column. Like, you know, the system security papers will all start with like a very long explanation of exactly what's happening. The results are expected to be really rigorously done. Machine learning paper often is here is a new cool idea. Maybe it works. And like, this is good for like, you know, move fast and break things. This is not good for like really systematic studies. You know, when I was doing system security papers, I would get like, you know, one, one and a half, two a year.

And now a similar kind of thing of machine learning papers, you could probably do five or six or something to the same level of rigor. And so I feel like this is maybe the biggest thing I see in my mind is the level of thought here that goes into some of these things. It's a conscious decision for the communities, right? And I think it's worked empirically in the machine learning space. It would not be good if every research result in machine learning needed to have the kind of rigor you would have expected.

for assistance paper because we would have had like five iteration cycles in total and you know at machine learning conferences you often see the paper the paper that approved upon the paper and the paper that approved upon that paper all at the same conference because the first person put it on archive and the next person found the tweak that made it better and the third person found the tweak that made it even better and like this is this is good like you know when the field is very new you want to allow people to to rapidly propose ideas that they don't

have full evidence of everything that's working. And when it feels much more mature, you want to make sure that you don't have people just proposing wild things that have been proposed 30 times in the past and they don't know that it works. And so I think, yeah, having some kind of balance and mix between the two is useful. And this, I think, is maybe the biggest difference that I see. And this is, I guess, maybe if there's some differential advantage that I have in the machine learning space, I think some of it comes from this where in systems I...

You were trained very heavily on this kind of rigorous thinking and how to do attacks very thoroughly. Look at all of the details. And when you're doing security, this is what you need to do. And so I think some of this training has been very beneficial for me in writing machine learning papers, thinking about all of the little details to get these points right. Because I had a paper recently where the way that I broke some defense and the way that the thing broke is because there was a negative sign in the wrong spot.

And like, it's like, this is not the kind of thing that like, I could have reasoned from first principles about the code. Like if I had been advising someone, like, I don't know how I would have told them, check all the negative signs. It's like, you don't know, like you just like, what you should be doing is, is this, like, you should be like understanding everything that's going on and find the one part where the mistake was made so that you can, you can break it by doing just the one right thing. And so this is maybe my biggest difference, I think, between these communities. Next article.

It was called Why I Use AI. And it was about a couple of months ago you wrote this. And you say that you've been using language models. You find them very useful. They improve your programming productivity by about 50%. I can say the same myself. Maybe let's start there. I mean, can you break down specifically the kind of tasks where it's really uplifted your productivity? So I am not someone who believes in these kinds of things.

You know, I don't. There are some people who their job is to hype things up and their job is to get attention on these kinds of things. And I feel like the thing that was annoyed about is that these people.

The same people who were, you know, Bitcoin is going to change the world, whatever, whatever. As soon as language models come about, they all go, language models are going to change the world. They're very useful, whatever, whatever. And the problem is that if you're just looking at this from afar, it looks like you have the people who are the grifters just finding the new thing. And they are, right? Like this is what they're doing. These people have no understanding what's going on in the world. They're trying to find whatever the new thing is that they can get them clicks.

But at the same time, I think that the models that we have now are actually useful. And they're not useful for nearly as many things as people like to say that they are. But for a particular kind of person, the person who understands what is going on in these models and knows how to code and can review the output, they're useful. And so what I wanted to say is like, I'm not going to try and argue that

they're good for everyone, but I want to say like, here's an N equals one me anecdote that I think they're useful for me. And if you have a background similar to me, then maybe they're useful for you too. And, you know, I've got a number of people who are like, you know, security style people who have contacted me and said like, you know, thanks for writing this. Like, you know, they have been useful for me. And yeah, now there was a question of, does my experience generalize to anyone else?

I don't know. This is not my job to try and understand this. But at least what I wanted to say was, yeah, they're useful for people who behave like I do. Okay, now, why are they useful? The current models we have now are good enough that at the kinds of things that where I want an answer to this question.

whether it's write this function for me or whatever, do this. Like I know how to check it. I know that I could get the answer. It's like something I know how to do. I just don't want to do it. The analogy I think is maybe most useful is imagine that you had to write all of your programs in C or in assembly. Like would this make it so that you couldn't do anything that you can do now? No, probably not. Like you could do

all of the same research results in C instead of Python, if you really had to, it would take you a lot longer because you have an idea in your mind, you know, I want to implement, you know, let's something trivial, you know, some binary search thing. And then in C, you have to start reasoning about pointers and memory allocation and all these little details that like are at a much lower level than the thing, the problem you want to solve. And the thing I think is useful for language models.

is that if you know the problem you want to solve and you could like check that the answer is right, then you can just ask the model to implement for you the thing that you want in the words that you want to just type them in, which are like not terribly well-defined. And then it will give you the answer and you could just check that it's correct and then put it in your code and then continue solving the problem you want to be solving and not.

the problem that you had to do to actually type out all the details that's maybe the biggest class i think of things that i find useful and the other class of things i find useful are the cases where you rely on the fact that the model has like just enormous knowledge about the world and about all kinds of things and if you understand the fundamentals but like i don't know the api to this thing just like

make the thing work under the API and I can tech check that easily. Or, you know, I don't understand how to write something in some particular language. Like give me the code. Like if you, if you give me code in any language, even if I've never seen it before, I can basically reason about what it's doing. Like, you know, I may make mistakes around the border, but like I could never have typed it because I don't know the syntax, whatever. The models are very good at giving you the correct syntax.

and just like getting everything else out of the way. And then I can figure out the rest about how to do this. And, you know, if I couldn't ask the model, I would have had to have learned the syntax of the language to type out all the things or do what people would do, you know, five years ago, copy and paste some other person's code from Stack Overflow and, you know, to make annotations. And it was just like a strictly worse version of just asking the model because now I'm relying on me who doesn't know anything to just, you know, do copy and paste. And so, you know, this is, I guess, my...

My view is that for these kinds of problems that they're currently plenty useful. If you already understand, and by that, I mean an abstract understanding, then they're a superpower, which explains why, you know, like the smarter you are, actually, the more you can get out of a language model. But how has your usage evolved over time? And just what's your methodology? I mean, you know, speaking personally, I know that specificity is important. So going to source material and constructing the prompt, you know, imbuing my understanding and reasoning process into the prompt. I mean, how do you think about that? Yeah.

I guess I try and ask questions that I think have a reasonable probability of working. And I don't ask questions where I feel like this was going to slow me down. But if I think it has, you know, a 50% chance of working, I'll ask the model first. And then I'll look at the output and see like, does this directionally look correct? And if it seems like

directionally, it maybe is going to approach the correct kind of solution. Then I might iterate a little more. And if it gives me a perfect solution the first time, then great, I accept it. And then I learn, okay, the models are not very good at this kind of problem. I just won't ask that again in the future. And so I feel like some people who say they can't get models to do anything useful for them, it may be the case that models are just really bad at your particular kind of problem. It may also just be you don't have a good understanding of what the models can do yet.

I think most people today have forgotten how much they had to learn about how to use Google search. People today, if I tell you to look something up, you implicitly know the way that you should look something up is to use the words that appear in the answer. You don't ask it as the form of a question. There's a way that you type things into search engines to get the right answer.

This requires some amount of skill and understanding about how to reliably find answers to something online. I feel like it's the same thing for language models. They have a natural language interface. So like technically you could type whatever thing that you wanted. There are some ways of doing it that are much more useful than others. And I don't know how to teach this as a skill other than just saying like, try the thing. And maybe it turns out they're not good at your task and then just don't use them. But if you are...

able to make them useful then this seems like a free productivity win but um you know this is the kind of thing where yeah again caveat it on you have to have some understanding what's actually going on with these things because you know there are people who don't who i feel like have who can try and do these similar kinds of things and then i'm worried about you know like

Are you going to learn anything? You won't catch the bugs when the bugs happen. All kinds of problems that I'm worried about from that perspective. But for the practitioner who wants to get work done, I feel like in the same way that I wouldn't say you need to use C over Python, I wouldn't say you need to use just Python over Python plus language models. Yes, yes. I agree that laziness and acquiescence is a problem.

vibes and intuition are really important i mean i consider myself a jedi of using llms and sometimes it frustrates me because i say to people oh you know just use it using that i seem to be able to get so much more out of llms than other people and i'm not entirely sure why that is maybe it's just because i understand the thing that i'm prompting or something like that but it seems to be something that we need to learn yeah i mean every time a new tool comes about you have to spend some time you know i i remember when people would say

Real programmers write code in C and don't write it in a high-level language. Why would you trust the garbage collector to do a good job? Real programmers manage their own memory. Real programmers write their own Python. Why would you trust the language model to output code that's correct? Why would you trust it to be able to have this recall? Real programmers understand the API and don't need to look up the reference manual. You can draw the same analogies here. I think this is the case of...

when the tools change and make it possible for you to be more productive in certain settings you should be willing to look at them into the new tools i know i'm always trying to rationalize this because it comes down to this notion of is the intelligence in the eye of the prompter you know i does it matter like this is like i guess like this is like maybe the difference between how i use things other people is the thing

makes me more productive and solves the task for me uh was it the case that i put the intelligence in maybe i think in many cases i think the answer is no in some cases i think the answer is yes but i'm i don't i'm not gonna look at it this way i'm gonna look at it as like is it solving the questions that i want in a way that's useful for me i think here the answer is definitely yes but

Yeah, I don't know how to how to answer this in some real way. So you obviously as a security researcher, how does that influence the way that you use LLMs?

Oh, yeah. No, this is why I'm scared about the people who are going to use them and not understand things. Because, you know, you ask them to write an encryption function for you. And the answer really ought to be, you should not do that. You should be calling this API. And oftentimes, they'll be like, sure, you want me to write encryption function? Here's the answer to an encryption function. And it's going to have all of the bugs that everyone normally writes. And this is going to be terrible. The same things for, you know, I was writing some random stuff that, like, made calls to a database.

And what did the model do? It wrote the thing that was vulnerable to SQL injection. And this is terrible. If someone was not being careful, they would not have caught this. And now they've introduced all kinds of bad bugs. Because I'm reasonably competent at programming, I can read the output of the model and just correct the things where it made these mistakes. It's not hard to fix the SQL injection and replace the string concatenation with the templates. The model just didn't do it correctly.

And yeah, so I'm very worried about the kind of person who's not going to do this. There have been a couple of papers by people showing that people do write very insecure code when using language models when they're not being careful for these things. And yeah, this is something I'm worried about. It looks like it might be the case that it's like differentially more vulnerable when people use language models versus when they don't. And yeah, this is, I think, a big concern.

I tend to think about like this utility question is often just from the perspective of, yeah, security of things that people use actually matters. And so I want to know what are the things that people are going to do so you can then write the papers and study what people are actually going to do. So I feel like it's important to separate, can the model solve the problem for me? And the answer for language models using it is oftentimes, yes, it gives you the right answer for the common case. And this means most people don't care about the security question.

And so they'll just use the thing anyway, because it gave them the ability to do this new thing, not understanding the security piece. And so that means we should then go and do security around this other question of like, we know people are going to use these things. We ought to do the security to make sure that the security is there so that they can use them correctly.

And so I often try and use things that are at the frontier of what people are going to do next just to try and put myself in their frame of mind and to understand this. And this worries me quite a lot because things could go very bad here. How and when do you verify the outputs of LLMs? The same way that you output. I mean, this is the other thing. People say, maybe the model is going to be wrong, but half of the answers on Stack Overflow are wrong anyway.

So it's not the case that if you've been programming for a long time, you're used to the fact that you read code that's wrong. I'm not going to copy and paste some function on Stack Overflow and just assume that it's right. Because maybe the person asked a question that was different than the question that I asked. Whatever. I don't feel like I'm doing anything terribly different when I'm verifying the output of a language model code versus when I'm verifying the output of some function that I found someone else right online.

Maybe the only difference is that I'm using the models more often. And so I have to be more careful and checking, like, you know, if you're using something twice as often, then if you're redefining bugs with something, you know, you're going to have twice as many bugs and use it twice as much. And so you have to be a little more careful, but I don't feel like there's anything I'm doing that's especially different in quality. It's just don't trust the thing to give you the right answer and understand the fact that like.

95% solutions are still 95% solutions. You take the thing, it does almost everything that you wanted, then it maxed out its capability. It's good. You're an intelligent person. Now you've finished the last 5%, fix whatever the problem is, and then there, you have a 20x performance increase there. Yeah, you touched on something very interesting here because actually most of us are wrong most of the time.

and that's why it's really good to have at least one very smart friend because they constantly point out all of the ways in which your stuff is wrong most code is wrong i mean it's your job to point out how things are wrong and i guess we're always just kind of on on the boundary of wrongness unwittingly and that's just the way the world works anyway yeah um yeah right and so i think yeah i mean i think that

There's a potential for massive increases in quantity of wrongness with language models. There are lots of things that could go very wrong or go very bad with language models. Previously, the amount of bad code that could be written was limited to the number of humans who could write bad code. Because there's only so many people who could write software. And you had to have at least some training. And so some bounded amount of bad code.

One of the other things I'm worried about is, you know, you have people who look at these people saying models can solve all your problems for you. And now you have 10 times as much code, which is great from one perspective, because isn't it fantastic that anyone in the world can go and write whatever software they need to solve their particular problem? That's fantastic. But at the same time, security in person and me is kind of scared about this because now you have 10 times as much stuff that is probably very insecure.

And you are not going to be able to have, you don't have 10 times as many security experts to study all of this. Like you're going to have a massive increase in this and some potential futures. And yeah, this is one of the many things that I'm, I think I'm worried about. And like, is why I try and use these things to understand, like, does this seem like something people will try and do? It seems to me the answer is yes right now. And yeah, this worries me.

So I spoke with some Google guys yesterday and they've been studying some of the failure modes of LLM. So like just really crazy stuff that people don't know about, like they can't copy, they can't count, you know, because of the soft max and the topological representation squashing in this particular, loads and loads of stuff they can't do. In your experience, have you noticed some kind of tasks that LLMs just really struggle on? I'm sure that there are many of them. I have sort of learned to just not ask those questions.

And so I have a hard time with like coming up like, you know, in the same sense, like what are the things that search engines are bad for? You know, I'm sure that there are a million things that search engines are like completely the wrong answer for. But if I sort of pressed you for a question, answer this right now, you'd have a little bit of a hard time because the way that you use them is the things that they're good for. And so like, yeah, so yes. So all of these things, like whenever you want like correctness in some sense, the model is not the thing for you.

um but like in terms of like specific tasks that they're particularly bad at um i mean of course you can say anything that requires some kind of if it would take you more than you know 20 minutes to write the program probably the model can't get that but like this is the problem with this like this is changing you know like i so okay so this is like the other thing like there are things that like i thought would be hard

that end up becoming easier. So there was a random problem that I wanted for unrelated reasons that like, it's a hard dynamic programming problem to solve. It took me like, I don't know, two or three hours to solve it the first time that I had to do it. And so 01 just launched a couple of days ago. I gave the problem to 01 and it gave me an implementation that was 10 times faster than the one I wrote in like two minutes.

and so i like and like i can test it because like i have a reference solution and like it's correct and like it like it's like okay so now i've learned like here's the thing that i previously would have thought like i would never ask models to solve something because this was like a challenging enough algorithmic problem for me that i would have no hope of the model solving and now i can um but there are other things that you know seem trivial to me that the models get wrong but i mostly have just like

not asked those questions but yeah this is why i guess going back to the thing i'm worried about is like i worry people will will not have the experience to check when the answers are right and wrong and they'll just apply the wrong answer as many times as they can and that seems concerning yeah i mean this is part of the anthropomorphization process because i find it fascinating that i think you know we have vibes we have intuitions and we actually know

And we've learned to skirt around the failure mode, you know, the long tail of failure modes. And we just smooth it over in our supervised usage of language models. And the amazing thing is we don't seem to be consciously aware of it. Yeah, but like programmers do this all the time, right? Like you have a language, the language has some like, okay, so let's suppose you're someone who writes Rust. Rust has a very, very weird model of memory.

If you go to someone who's very good at writing Rust, they will structure the program differently so they don't encounter all of the problems because of the fact that you have this weird memory model. But if I were to do it, like I'm not very good at Rust, like I try and use it and like I try and write my C code in Rust and like the borrow checker just like yells at me to no end and I can't write my program. And like I look at Rust and go like, I see that this could be very good, but I just don't know how to get my code right because I haven't done it enough. And so...

I look at the language and go, okay, if I was not being charitable, I would say, why would anyone use this? It's impossible to write my C code in Rust. I'm supposed to have all these nice guarantees, but no, you have to change the way you write your code in order to change your frame of mind, and then the problems all just go away. You can do all of the nice things, just accept the way that the paradigm is supposed to be operating in, and the thing goes very well. I see the same kind of analogy for some of these kinds of things here, where

The models are not very good in certain ways. And you're trying to imagine that the thing is a human and ask it the things you would ask another person, but it's not. And you need to ask it in the right way, ask the right kinds of questions, and then you can get the value. And if you don't do this, then you'll end up very disappointed because it's not superhuman. What are your thoughts on benchmarks? Okay, yes, I have thoughts here.

This, I guess, is the problem with language models is we used to be in a world where benchmarking was very easy because we wanted models to solve exactly one task. And so what you do is you measure it on that task and you see, can it solve the task? And the answer is yes. And so great, you figure it out. The problem with this is like that task was never the task we actually cared about. And this is why no one used models.

No ImageNet models ever made it out into the real world to solve actual problems because we just don't care about classifying between 200 different breeds of dogs. The model may be good at this, but this is not the thing we actually want. We want something different. And it would have been absurd at the time to say the ImageNet model can't solve this actual task I care about in the real world because, of course, it wasn't trained for that. Language models...

The claim that people make for language models and what people who train them is, I'm going to train this one general purpose model that can solve arbitrary tasks. And then they'll go test it on some small number of tasks and say, see, it's good because it can solve these tasks very well. And the challenge here is that if I trained a model to solve any one of those tasks in particular, I could probably get really good scores. The challenge is that

You don't want the person who has trained the model to have done this. You wanted them to just train a good model and use this as an independent, you know, just here's a task that you could train the model. You could evaluate the model on completely independent from the initial training objective in order to get like an unbiased view of how well the model does. But people who train models are incentivized to make them do well on benchmarks. And while in the old world,

You know, I trust researchers not to cheat. So like, suppose I wanted to have maximum image net test accuracy. In principle, I could have trained on the test set. Like this is like actually cheating. You don't train on the test set. So I trust that people want to do this. But suppose that I give you a language model and I want to evaluate it on, you know, coding, which I'm going to use, you know.

a terrible benchmark, but human email, whatever. I'm going to use MMLU, I'm going to use MMMU, whatever the bad cases may be. I may not actually train the model on the test set of these things, but I may actually train my model in particular to be good on these benchmarks. And so you may have a model that is not very capable in general, but on these specific 20 benchmarks that people use, it's fantastic. And

This is what everyone is incentivized to do because you want your model to have maximum scores on benchmarks. And so I think I would like to be in a world where there were a lot more benchmarks so that this is not the kind of thing that you can easily do and you can more easily trust that these models are going to give you the right answers. They accurately reflect what their skill level is in some way that is not...

being designed by the model trainer to maximize the scores so at the moment you know like the the hyperscalers that they put incredible amounts of work into benchmarking and so on and now we're moving to a world where we've got you know test time inference test time active fine tuning you know people are fine tuning quantizing fragmenting and so on and a lot of the people doing this in a practical sense can't really benchmark in in the same way how do you see that playing out okay that i don't know i feel like

If you're doing quantizing and stuff, good luck. I don't know. It just seems very hard to test what these things are. You can use the average benchmarks and hope for the best, but I don't. I feel like the thing I'm more worried about is people who are actively fine-tuning models to show that they can make them better on certain tasks. So you have lots of fine-tunes of LLAMA, for example, that are claimed to be better.

And they'll show all the benchmark numbers. And it just turns out that what they did was they just really trained the models to be good on these specific tasks. And if you ask them anything else, they're just really bad. I think that's the thing I'm more worried about. But yeah, for other cases, I don't know. I agree this is hard, but I don't have any great solutions here. That's okay. We can't let you go before talking about one of your actual papers. I mean, this has been amazing talking about general stuff.

I decided to pick this one, stealing parts of a production language model. This is from July. Could you just give us a bit of an elevator pitch on that? For a very long time, when we did papers in security, what we did was we would think about how a model might be used in some hypothetical future and then say, well, maybe we have...

certain kinds of attacks that are possible. Let's try and show in some theoretical setting, this is something bad that could happen. And so there's a line of work called model stealing, which tries to answer the question, can someone take the model that you have and without, and just like by making standard queries to your API, steal a copy of it.

This was started by Florian Tramer and others in 2016, where they did this on like very, very simple linear models over APIs. And then it became a thing that people started studying on deep neural networks. And there were several papers in a row by a bunch of other people. And then in 2020, we wrote a paper that we put at Crypto that said, well, here is a way to steal an exact copy of your model.

Whatever the model you have is, I can get an exact copy. As long as you have a long list of assumptions, it's only using a ReLU activation. The whole thing is evaluated in floating point 64. I can feed floating point 64 values in. I can see floating point 64 values out. The model is only fully connected. Its depth is no greater than three. It has no more than 32 units wide on any given layer. It just has a long list of things that are

Never true in practice. But it's a very cool theoretical result. And there are other papers of this kind that show how to do this kind of, I steal an exact copy of your model, but it only works in these really contrived settings. This is why we submitted the paper to crypto, because they have all these kinds of theoretical results that are very cool, but are not immediately practical in many ways. And then there was a line of work continuing extending upon this.

And the question that I wanted to answer is like, now we have these language models. And if I list all of the assumptions, all of them are false. You know, it's not just ReLU only activations. It's not just fully connected. I can't send floating point 64 inputs. I can't view floating point 64 outputs. They're like a billion neurons, not 500. You know, so like all these things that are true. And so I was wanting to answer the question, like, what's the best attack that we can come up with?

that actually I can implement in practice on a real API. And so this is what we tried to do. We tried to come up with the best attack that works against the most real API that we have. And so what we did is we looked at the OpenAI API and some other companies. Google had the same kind of things. And because of the way the API was set up, it allowed us to get some degree of control over the outputs.

that let us do some fancy math that would steal one layer of a model. It's like among the layers in the model, it's probably the least interesting. It's a very small amount of data, but I can actually recover one of the layers of the model. And so it's real in that sense that I can do it. It's also real in the sense of I have the layer correctly. But it's not everything. And so I think what I was trying to advocate for in this paper is

I think we should be pursuing both directions of research at the same time. One is write the papers that like are true in some theoretical sense, but are not the kinds of results that you can actually implement in any real system and likely for the foreseeable future are not the kinds we'll be able to implement in any real systems. And also at the same time, do the thing that most security researchers do today, which is look at the systems as they're deployed and try and answer.

Given this system as it exists right now, what are the kinds of attacks that you can actually really get the model to do and try and write papers on that pieces of it? And I don't know what you're going to do with the last layer of the model. You know, we have some things you can do. But if one thing that tells you like the width of the model, which is not something that people disclose. So in our paper, we have, I think, the first.

public confirmation of the width of the GPT-3 ADA and Babbage models, which is not something that OpenAI ever said publicly. They had the GPT-3 paper that gave the width of a couple of models in the paper, but then they never really directly said what the sizes of ADA and Babbage were. People speculated, but we could actually write that down and confirm it. As part of the paper, we ran the attack on GPT-3.5.

And we correctly stole the last layer, and I know the size of the model, and it is correct. So it goes to responsible disclosure, like we talked about at the beginning. We agreed with them ahead of time we were going to do this.

This is a fun conversation to have with, you know, not only Google lawyers, but OpenAI lawyers. Like, hi, I would like to steal your model. May I please, please do this? You know, OpenAI people were very nice and they said yes. The Google lawyers initially were also very like, you know, before the Google lawyers, like I would like to steal OpenAI's data, like under no circumstances. But like I said, like if I get the OpenAI General Counsel to agree, are you okay with that? They said, sure. Like we put it on an isolated VM. We ran everything. We destroyed the data, whatever. But like.

part of the agreement was like, they would confirm that we, that we got the right, that we did the right thing, but they asked us not to release the actual data we stole, which like makes, makes sense, right? Like, you know, you want to make, you want to show here's an attack that works, but like, that's not actually released the stolen stuff. And so, yeah, so, so, you know, if you were to write down a list of like all the people in the world who know how big GPT 3.5 is, the list includes like all current and former employees of OpenAI.

And me. And so it sounds like this is a very real attack because this is the easiest way. How else would you learn this? The other way to learn this would be to hack and open AI servers and try and learn this thing or blackmail one of the employees. Or you can do an actual adversarial machine learning attack and recover the size of those models and the last layer. And so that's the motivation behind why we wanted to write this paper was to get

examples and try and encourage other people to get examples of attacks that even if they don't solve all of the problems will let us make them increasingly real in this sense and I think this is something that we'll start to need to see more of as we as we start to get

systems deployed into more and more settings and so that was like why we did the paper i don't know if you want to talk about the technical methods behind how we did it or something but it's yeah do you want to go there okay um sure i can i can try um yeah okay so uh for the next uh two minutes uh let's assume some level of linear algebra knowledge if this is not you uh then um i apologize i will try and explain it in a way that makes it makes some sense

So the way that the models work is they have a sequence of layers and each layer is a transformation of the previous layer. And the layers have some size, some width. And it turns out that the last layer of a model goes from a small dimension to a big dimension. So this is like the internal dimension of these models is, I don't know, let's say 2048 or something. And the output dimension is the number of tokens in the vocabulary. This is like 50,000.

And so what this means is that if you look at the vectors that are the outputs of the model, even though it's in this big giant dimensional space, this 50,000 dimensional space, actually the vectors, because this was a linear transformation, are only in this 2,000, 4,000 dimensional subspace. And what this means is that if you look at this space, you can actually compute what's called the singular value decomposition to recover.

how the space was embedded into this bigger space and this directly like the the number of okay i'll say a phrase the number of non-zero singular values tells you the size of the model again like it's like this it's not challenging math it's like this is you know the last time i used this was an undergrad in math um but you know it's if you work out the details it ends up working out and it turns out that uh yeah this is an exciting uh

It's like a very nice application of some nice math to these kinds of things. And I think part of the reason why I like the details here is this is like the kind of thing that like it doesn't require an expert in any one area. Like it's like undergrad knowledge math. Like I could explain this to anyone who has completed the first course in linear algebra.

But you need to be that person and you need to also understand how language models work and you need to also be thinking about the security and you need to be thinking about what the actual API is that it provides because you can't get the standard stuff. You have to be thinking about all the pieces. This is why I think the paper is interesting. This is what a security person does. It's not the case that we're looking at anything.

Sometimes you look at something far deeper than any one thing, but most often what these exploits, how they happen, is that you have a fairly broad level of knowledge and you're looking at how the details of the API interacts with how the specific architecture of language models is set up using techniques from linear algebra. And if you were missing any one of those pieces, you wouldn't have seen this attack was possible, which is why the OpenAI API had this for...

three years and no one else found it first it's like they were not looking for for this kind of thing you don't stumble upon these kinds of vulnerabilities like you need people to actually go look for them and then you know again responsible disclosure we gave them 90 days to fix it they patched it google patched it a couple of other companies who we won't we won't name because they asked not to patched it and um it works and so that was a fun paper to write

Amazing. Well, Nicholas Carlini, thank you so much for joining us today. It's been an honor having you on. Thank you.

This transcript was generated by Metacast using AI and may contain inaccuracies. Learn more about transcripts.