A Conversation with Michael Brown About Designing AI Systems - podcast episode cover

A Conversation with Michael Brown About Designing AI Systems

Aug 22, 202550 min
--:--
--:--
Download Metacast podcast app
Listen to this episode in Metacast mobile app
Don't just listen to podcasts. Learn from them with transcripts, summaries, and chapters for every episode. Skim, search, and bookmark insights. Learn more

Episode description

In this episode of Unsupervised Learning, I sit down with Michael Brown, Principal Security Engineer at Trail of Bits, to dive deep into the design and lessons learned from the AI Cyber Challenge (AIxCC). Michael led the team behind Buttercup, an AI-driven system that secured 2nd place overall.

We discuss:

-The design philosophy behind Buttercup and how it blended deterministic systems with AI/ML 
-Why modular architectures and “best of both worlds” approaches outperform pure LLM-heavy -designs
-How large language models performed in patch generation and fuzzing support
-The risks of compounding errors in AI pipelines — and how to avoid them
-Broader lessons for applying AI in cybersecurity and beyond

If you’re interested in AI, security engineering, or system design at scale, this conversation breaks down what worked, what didn’t, and where the field is heading.

Subscribe to the newsletter at:
https://danielmiessler.com/subscribe

Join the UL community at:
https://danielmiessler.com/upgrade

Follow on X:
https://x.com/danielmiessler

Follow on LinkedIn:
https://www.linkedin.com/in/danielmiessler

Become a Member: https://danielmiessler.com/upgrade

See omnystudio.com/listener for privacy information.

Transcript

S1

All right, Michael, welcome to unsupervised learning.

S2

Hey, it's great to be here. Thanks for having me.

S1

Yeah. So, uh, lots to talk about here. Uh, can you give a quick intro on yourself?

S2

Yeah, sure. So, uh, my name is Michael Brown. I'm a principal security engineer at trilobites. I lead up our company's AI and ML security research group. We really focus on two kinds of, uh, intersections between AI, ML, and security. It's primarily using AIML technologies to solve traditional cybersecurity problems that are really hairy and really kind of sticky, and conventional methods have kind of failed to address. And then we also, uh, to a smaller degree, look at, um,

the security of AIML based systems. So, um, I was also the lead designer, um, in team lead for um, trilobites team that entered into the AI Cyber Challenge. Uh, we built the tool called Buttercup, which took second place in And overall in the iacc. And, um. Yeah, that's about it.

S1

Yeah. That's perfect. And that's exactly what I'd like to chat about. Um, so I guess, um, I guess the thing I'm most interested in is, uh, just the design of the system, and, um, I guess overall, what you know about the designs of the other system. So design versus design, system versus system. What? Whatever you want to share or can share. Like what? What are your thoughts on that? Um, I guess everyone releases open source. So maybe you've had a chance to look at some of

the other offerings. Maybe you've heard them talking, maybe you know, the teams. Uh, so I guess what kind of Intel do you have on what everyone else was doing versus what you guys were doing? And how do you think that went?

S2

Yeah. Well, um, yeah, I guess I can answer that last part pretty easily. It went pretty well for us. Um, so we took second place. Uh, the team that finished in first. Team Atlanta. Um, they had a pretty similar setup to ours. Um, they had more components, more moving parts, uh,

more pieces. They had more hands. Um, larger team to be able to kind of implement more, um, but ultimately they had a really similar kind of set of design principles, um, that worked out for us, the third place finishing team theory, they, um, had a bit of a deviation in terms of like their conceptual, uh, principles that guided how they built their system. But I can get into that in a bit. Um, I guess I can first start off by talking a

little bit about our concept. So it's interesting. Um, you know, the concept for Buttercup changed quite a bit over the course of the over the course of the AI Cyber Challenge. So this got announced, um, a couple years back, and there was a period of about 4 or 5 months, um, after the cyber challenge was announced, but before DARPA had really released any rules. So we didn't really know exactly how the competition was going to be structured. We structured.

We just knew that we would have to build a fully autonomous, AI driven system that could find and patch vulnerabilities, um, with a high degree of accuracy. Um, so originally, the concept that I drew up along with my co-creator Ian Smith, um, was originally really ambitious. Lots of moving parts, lots of static analysis, dynamic analysis, lots of, um, conventional techniques, lots of AIML based techniques. But ultimately, once the rules came out,

it kind of got pared down quite a bit. Um, some of the things that we wanted to do, um, were, were marked as like out of scope. Some of the stuff we wanted to do were marked as against the rules, um, just for the tractability of the competition.

S1

So is that because they were, they would have been too expensive. Didn't you have budgets you had to stay under?

S2

Yeah. So some of it was definitely, um, budgetary and some stuff was just, you know, flat out against the rules. We looked at fine tuning a large language model, um, with information about lots of open source software. And, um, there ended up being a rule about pre-baking models, so. okay, really, kudos to DARPA for making sure that, you know, competitors didn't have the ability to kind of, um, skew the systems that they build for the test, which is, you know,

finding and patching vulnerabilities and open source software. Um, so, yeah, there was a lot of stuff that gets cut down. Um, they got cut down. But ultimately the design of our system, um, was, was basically a pipeline. We we kind of broke the problem down. We realized we had to do basically 4 or 5 things really well. To win this competition, we had to be able to find vulnerabilities. And not only that,

we had to be able to prove they exist. So it wasn't enough just to, you know, use a static analysis scanner and say, hey, this thing thinks there's a vulnerability online. 50 of, you know, whatever, uh, you actually had to have a crashing test case for the first round of the competition in the semifinals. And in the finals. You didn't they, they relaxed this requirement. But the pathway

to getting lots of points basically still required one. Um, so you you had to find vulnerabilities and also prove they exist with a crashing input, or an input that would trigger a sanitizer in the target function. Um, you had to be able to contextualize and draw additional information about this vulnerability. Otherwise, patching was doomed to fail. Um, and then you had to patch the actually patched the vulnerability. Um, so this is a highly complex, uh, problem that conventional

approaches to software analysis have really kind of not addressed. Well, in my opinion. And it was a great area to use I. And then we also, you know, finally we had to orchestrate all of these functions and do really high quality engineering around all of them so that the

system would stay up and running for several days. Um, so based on those kind of 4 or 5, depending on how you chop them up, core principles or core tasks that we had to do, um, we kind of decided on an approach that we kind of call the best of both worlds, which was, you know, we knew that conventional software analysis, whether it's dynamic, static, hybrid, whatever, um, it really excels at certain subproblems within this pipeline. and

it really struggles with other ones. And AIML and specifically generative AI, which the competition was, was kind of heavily skewed towards generative AI. Generative AI does really well at certain types of subproblems in this pipeline, but also really struggles with others. So our approach is pretty straightforward. We're going to merge the best in class capability for each

part of this pipeline. Uh, stitch them together with high uptime, high reliability engineering code, um, and then focus on doing really, really well for the largest number of, um, the largest number of possible targets that we could possibly, um, that we could possibly do well in.

S1

Okay. Yeah. Interesting. So would you say that, um. Basically those those things that you described in the beginning, those are like modules and they should almost like, kind of work independently. So you can, like, hand a task to each of them. Is that kind of the the system design idea?

S2

Yeah. Yeah. So we, um, part of this was just surviving a really rapid development cycle. This wasn't really advertised all that well, but we actually only had about three months to develop the first version of Buttercup in the semi-finals. Um, and we actually had only had about six months to develop, um, the final version of Buttercup or Buttercup 2.0, which, which

took second place in the finals. Um, and that was because even though each round of the competition ran for a year, it took DARPA a while to solicit feedback from competitors, other stakeholders, and actually solidify the rules. Um, and so the rules were solidified. It was really at risk to do really kind of any development on the system. Also, certain things like the the technical specifics on their competition

API weren't available until later in the, in these cycles. Um, so part of the reason why we modularized each component was so that we could take smaller subteams within my larger team of about ten engineers, um, all working some degree of part time on this system so we can modularize it, keep them kind of separate. You know, it gives us this integration problem that we have to deal with at the end. We have to kind of put everything together and make sure that it runs well. Um,

but it was kind of a necessity. It was kind of a necessity because we had to work on developing everything independently. We couldn't afford to just do the first block. And is it becoming like that? You know, that meme of the horse drawing where really finally defined head and then as it gets towards like the the back parts of the animal, it turns into like a raw sketch. That was what was going to happen if we if

we didn't modularize this. Um, but it also helped because as we decided to change out strategies or play with different strategies, made it really easy to kind of plug and play different parts to see what would work later on.

S1

Yeah, that makes sense. So I keep having this debate with a whole bunch of people. It's kind of around, um, let the model do the work because the model is smarter. Um, and it just understands what to do. And then there's uh, the other argument, which is, um, build a robust system and you have the model kind of just be the intelligence that helps guide the system or moves things through the system, or maybe routes, uh, across the system or whatever.

But the system itself should be set up really well, and you're kind of like functioning as a router. And then when the model gets updated, it makes the system better. Um, but the counter to that is basically that we're just going to design bad systems. So we should stop trying to be rigid there and just use the model. Like where do you guys fall on that?

S2

Uh, I think it was probably closest to the second one and maybe more like an an undescribed third thing. So I'll kind of go over for I, um, you know, we've, we've been, you know, in me particular I've been doing research on like applied AI for, for security problems since before, uh, the large language model became the predominant form of technology. Back to, you know, 2018, 2019 time frame. Um, and uh, realistically, like large language models are great at a good number

of things. Um, but they really struggle with certain things. And particularly in a challenge like this where you have to do multiple things right in sequence in order to be successful, you have to worry about errors that start off in early stages of an LLM heavy pipeline that compound over time, until eventually you get to the point

where I think kind of collapses. Um, so our philosophy on using AI, uh, specifically within the AI cyber challenge and also kind of more broadly, um, is to use it for, um, tightly constrained, highly contextualized problems that, um, the models are set up for success. Um, so this is actually kind of an interesting anecdote. Um, during the first round of, uh, during the first round of the AI Cyber Challenge, um, the whole concept of like multi-agent systems,

systems that have, like, tools available to them. um, didn't really exist. It was like in a couple of papers on archive and ultimately, um, the way we built our aperture for the semi-finals and for the finals, um, is is now reflective of how LM driven systems are just built today. So it's actually really vindicating. So like our

patcher is a like a multi-agent system. It's got multiple large language models, each with different roles to play within this process that collaborate to generate a patch and then validate it to make sure that it's actually one will compile, two will actually fix the vulnerability that we've discovered. And three doesn't break other functionality within the program. So we found that trying to ask one large language model to

do all of that didn't really work out. And also in the semi-finals, the, the reasoning models, um, or the thinking models, depending on, on the branding, they didn't exist, they weren't available. They weren't even available to us to use as like, um, early adopter models in the a.i.c.c. So we were dealing with, with simple, you know, back

and forth, um, style chat models. Um, so we actually had to build in a lot of this reasoning as part of this, like multi-agent architecture, we had to build in a lot of like reliability and engineering code around maintaining the pipeline. Um, fortunately, the process for um, discovering

artifacts and submitting them was pretty rigid. Um, so it didn't really affect us that much in terms of or it didn't have to like put a lot of really complex reasoning in, um, but actually we ended up even by the end of the finals, we didn't use a reasoning or a thinking model, um, in Buttercup, because we'd actually built it in, it was part of the circuitry or part of like the, um, the Python code, part

of our orchestration code. Um, so we had the opportunity in the finals to take that out and let the model do the work. We kind of explored it a little bit, but ultimately we decided against it because the best case scenario was that the model would kind of figure out on its own how to break the problem down and how to do individual things, and what tools to call in sequence. Uh, but we were already subject matter experts who did it exactly the way it should

be done. So the the best case scenario is that the model was able to replicate what we've done only at a more expensive per call. Um, or more expensive, like number of volume of tokens. Um, so we actually kept, um, we, we did upgrade our models. We went from the GPT three series, um, and the Claude three, uh, series of models and moved up to, um, the four and like the basically the Gen four versions of models for the final.

So we, we upgraded the underlying models, but we very much, um, kept the problems very small for the, for the AI's or for the, um, for the AI models, so that we would avoid this issue where you have compounding errors, you have to worry about like these, these modulo errors of, you know, deciding to do the wrong thing in sequence.

And that actually turns out to be really, uh, to be penalize you heavily in these long systems because, you know, when a system decides, you know, hey, I've got to do A, B, C and D and C before b. All of that information involved with dealing with this like out of sequence task. It stays in the context window. And it kind of, for lack of a better term, kind of pollutes the model's ability to kind of reorder

those tasks and do them correctly. It has a hard time kind of forgetting information until it rolls out of the context window. So it's a really long way to say we probably did the latter version. But, um, one thing I do want to say is like the actual like processing of artifacts through the system, we didn't rely on the AI to kind of figure out, okay, I've got a vulnerability now I should patch it. That was also all, um, that was also all orchestrated, um, by our by our larger pipeline.

S1

Okay. Okay. So yeah, I've seen this a lot as well. I mean, I feel like this is a general concept that people are coming to, which is, um, I don't want to say legacy tech. Traditional tech is just like, deterministic. So, like, that's the tech that you want to use to, like, do things that matter, and then you kind of want to use like AI for like a, um, I don't know, like a router maybe, or like a, um, something intelligent about choosing which standard tech to use, but not making like, choices.

Maybe necessarily. Um, I don't know. I'm trying to figure out how to articulate that, but it's like.

S2

Yeah, well, it's actually funny you bring this up. I've had to kind of get good at articulating this, um, over the last couple of years. So the way I've explained this to people is that certain problems, particularly in computer science with this kind of generalizes everywhere. Certain problems lend themselves to prescriptive solutions. So prescriptive solution is something that we do when we write an algorithm to solve a problem. This could be like coming up with an

answer for the traveling salesman problem. You know, we know it's a really difficult problem to solve, but there's greedy algorithms that do a pretty good job and for the most part, will get you a good answer. Maybe not the best answer, but they'll get you a good one. So for these types of problems, you can prescribe a set of steps to the computer and let them execute them. Now other problems are really, really challenging to prescribe a

solution for. So these types of problems lend themselves to AI or ML techniques because you can use a descriptive instead of prescriptive solution. So a good example of this is like image recognition. So it's really really hard to take a picture of a cat and write a computer program that will say, okay, based on the pixel colors of this pixel and this position, this is going to be a cat, because a cat can be in a million different contortions. It can have different hair, the face

can be half obscured. But what we can do is we can describe to an AI ML model what a cat looks like with millions of pictures, because we have millions of pictures of cats. And then it can do a good job of solving that problem. Now it might make mistakes, but this is better than the option that you had with the traditional approach, because that approach was awful to begin with. So a good example of a

corollary for this in Buttercup is patch generation. There's a lot of synthetic code generation tools and a lot of research in this area. But in terms of like automatically generating patches to fix bugs, unless your bug is like dead obvious, like it's missing a bounds check and it's really easy to apply some sort of pattern matching to figure out what the lower bound is, or the upper bound is that needs to be checked. Um, tools to generate patches for weird bugs. Like they just don't exist.

So this is a great place for AIML to help us out. And it actually turns out, um, you know, this is really proven true by the AI Cyber Challenge and by Buttercup, more specifically, um, llms are great at generating code, um, because it's one of the biggest value propositions right now for the technology. So, um, generating patches for bugs is tightly constrained. It's not not asking you to generate all of the code that is necessary to build this entire system that I've got a spec sheet for.

I'm only asking it given this code, and given what we know about this vulnerability, how would you change it to fix it? The large language models have already internalized internalize large numbers of incremental commits to open source code repositories that fix bugs, so they actually have a really good track record with, um, more than I expected, even

when we started this, uh, with generating patches. So this is a great example of where generating a patch is something that lends itself towards a descriptive solution and a descriptive algorithm, uh, or an AIML algorithm versus something that's prescriptive, um, which is fuzzing. Fuzzing is a good example of a prescriptive solution. If you if you need to find a vulnerability and you need a crashing input, um, you have

to be able to prove that it exists. It's really, really hard to get an LLM to do that because llms the underlying reasoning. They don't have like data feedforward. Um, they basically they look at source code like they look at natural language. Natural language doesn't describe the activities of an underlying state machine that runs on hardware after it passes through a compiler. So like, you know, the source code when looked at by a model. Models look at

source code in a really shallow way. Um, so when we want to find, you know, a crashing input, a fuzzer is a great way because we can prescribe a solution, which is try everything, brute force it. Um, just come up with different inputs, throw it in there, and then if it crashes, well, there you go. You've proven it. So that's what fuzzing heavily early on. You know, for one type of problem we use patching heavily for another.

S1

Yeah, that makes sense. And the other problem with, um, finding vulns with with um, I also seems to me that, um, they, they want to please there's they're heavily biased to be like, this is it. This is one. Yeah. Well, this is definitely a hit or whatever. And you look at it and it's actually not. So I guess the intelligence is deciding to use the fuzzer, which it could help make that decision that a fuzzer should be used. Right.

S2

Yeah. Yeah. So it's it's funny you bring that up. Large language models really struggle to solve problems that aren't rooted in some kind of ground truth. Um, it turns out there's a huge difference there. We have some internal research that we haven't published. Anybody could reproduce it. But, um, so it turns out if you if you have a bit of source code and you ask the model to tell you where the vulnerability is, um, it will absolutely

hallucinate a vulnerability because it wants to please you. Uh, we have one of our researchers, um, one of our principal researchers, Artem. He's a great guy. He, um, he downloaded the, um, formally, correct. Uh, the formally proven correct portions of, uh, of Linux and asked a large language model several hundred times. Um, here's a snippet of code. It has a vulnerability where it is, and every single time it would find it would manufacture vulnerability because it

wants to find the answer. So it turns out when we started asking it, is there a vulnerability? Um, it messed up a little less, but it would still assume that because you're asking that there's something to find and it would still mess up quite a bit. So that's why when we're in the concept where we're, when we're using, um,

large language models for generating patches. It's great because we know there's a vulnerability because we found it and we proved it, and we can collect additional information.

S1

Yeah.

S2

So now I don't have to worry about asking the model. Hey, do you think there's a vulnerability? And if so, patch it. I say no, there is a vulnerability. It's here. This is extra information about a code that touches it. Now generate a patch. And the model is very good at doing that because it takes away the decision making or, or the judgment call that large language models are really, really bad at because they don't actually model judgment calls underneath.

And their architecture, they, they model, you know, sequencing information, sequencing tokens. And when you write code, you're writing a sequence of tokens. So these problems tend to be, um, a lot more suitable than other problems where you're asking it to find the ground truth for you, bad problems for llms asking it to take ground truth and expand upon it. Great applications for Llms.

S1

Oh man, I love that. And this also goes to your previous point of not wanting to pollute the context for the current task on hand, which is building that patch, because if you have like some history of like there were previous decisions made or previous questions asked or whatever it might get like diverted, you know?

S2

Yeah, absolutely. It's um, it's a, it's a big challenge particularly, um, I don't know, it's funny. I've, I've been kind of trying to sing this gospel internally, uh, at Trail of Bits and to other people who will listen that, um, the increasing size of context window is not always your friend. Um,

by increasing the size of the context window. I mean, if you think about how the large language model works under the hood, it's using these contexts to attune the model to certain parts of its training data that are going to be highly relevant to solving your particular problem. And the more words and the more tokens you put into the context window, the more you are kind of nulling out or, um, numbing the attention mechanism. You're forcing it to become more and more general, because now there

are more tokens that are affecting these attuned probabilities. So you actually are better off with using now. Context window is great because if you need, let's say a million, you know, a million tokens in your context window to constrain the problem, then use a million tokens. But if you can do it for 1000 or 10,000, you're going to get better results because you're more likely to focus that model where it needs to be.

S1

Yeah, I love this. Like, by the way, this this this is great. This is great. Um, I'm going to create a lot of content out of this, um, because it's, it's really crystallizing in like one starting to form something in my mind. I'd love to work with you on it. Um, essentially, what I'm trying to think of is, um, what are some general statements that we could make? Um, one that I'm sort of heading in the direction of, you tell

me if I'm wrong is like. And this might be overstating it, but like, the system itself should be highly modular and and most as much as possible made up of traditional and deterministic tech. And then the way that you use the AI is for the specific type of problem, which we're going to articulate the way you articulated it for those types of problems where routing is needed to the traditional tech. Um, and it's like, don't just go crazy with AI. Don't ask it questions that the traditional

text should be answering. Um, it's something like that. And then ultimately you have like this dependable deterministic system with the minimum amount of AI that is required to move appropriately through that system.

S2

Yeah. So yeah, really it comes down to problem formulation. And this is like the the great part about and this is part of the reason why you see such a huge overlap in interest between people from the computer science background and people from like data science backgrounds on here because, you know, one of the basic things you learn in computer science, like when you get to like

the graduate level is problem formulation. It's how to recognize your problem as a derivative, or maybe a like dressed up version of some other problem. So, you know, right away, um, okay, I have this problem of, okay, I've got to manage this delivery system. How do I make this delivery system, um, for Amazon efficient? You can recognize this right away as, oh, this is traveling salesman. There's no good way to do this. But what I can do is I can. I'm going

to get a good answer. I just have to accept that my answer is going to be imprecise or not necessarily optimal. Um, and in applying AI and ML to security problems or any problem in general, the first step

is very much like problem formulation. It's understanding what kind of model is going to work best for this problem, because is this a problem that will work well with a time series model, because my data is coming in over time, or is this a model that's going to work well with, um, let's say like a, like linear regression, because there is some true underlying probability for how the data is distributed that I'm trying to learn from one of like the kind of curses of large language models

is that they have abstracted all of this good data science practice, all these good data science practices away. And now it's great because it democratizes it. Anybody can use AI, anybody can use an LLM. And all you have to do is be able to articulate your problem. The problem is, is that it also abstracts away problem formulation. And now we're starting to use Llms because they're accessible for certain types of problems that they're really not well formulated for. Um.

S1

Yeah.

S2

So this is this is kind of where we get to the issue. So the good news is we don't have to just like say, okay, well, I can't do problem formulation with an LLM, so I just throw it away. Don't use it. I have to go back to, you know, TensorFlow and writing my own models and stuff. What we really have to do is get to what you were describing, which is rather than throw the LLM at a large problem. We take it a step further. We break the problem down.

Are there subproblems that are highly amenable to AI solutions? I have a litmus test that I, that I pass, um, you know, problems through. And I try to encourage my team members to use, um, which is, you know, basically like a check to see whether a problem is good for AIML. And it's usually, you know, do you have enough data in the model that you can train? In

this case, it now becomes is the LLM. Does the LLM have examples of this on the internet that it can draw from, or are you asking it to do something like reverse engineering, you know, firmware code on this obscure chipset that like there's no examples on the internet, bad example or to it won't have it won't have anything to draw from. Number two, um, is there some

probabilistic nature to the data that's underlying? This is actually makes large language models really bad for a lot of security problems, because they're what we call non-differentiable, meaning that they don't have like this nice curved space that you can use stochastic gradient descent or virtually any optimization function to try and climb and find a good answer for it actually exists more of like this kind of cloud

with dots of answers all over the place. If you were to try and imagine the answers to security questions in like a mathematical graph.

S1

Okay, what's an example of what's an example of one of those? I'm, I'm trying to think of what that space might look like.

S2

Yeah. So a good example of like a problem that is differentiable is like housing prices. So housing prices vary by, you know, like the size by square footage. Yeah. Square footage, number of rooms, zip code quality of the schools. So when you plot these all out you get something that you can do linear regression on. You can see like.

S1

A.

S2

Little loop. And that's called a differentiable function because it's a continuous line that you can draw through the data that more or less minimizes the error of those points along the line.

S1

Yep.

S2

But if we want to think about, um, let's say now optimizing a program, we can take a look at how ordering certain steps or changing the way we implement certain functions as changing the speed of a program up

and down, and that becomes kind of pseudo differentiable. It's it's more like a step function where you have kind of like little lines where if I change this one thing, it jumps up a little bit, it's more jagged, but there's still, um, it's close to differentiable because I can kind of map deterministically how if I run it on, you know, with this set of compiler optimizations or that

it's definitely not differentiable, but it's closer. Security is just wild because the flaws in computer programs can come from one of a million different sources. It can be a logic bug, it can be a mis implemented function. It can be the use of an unsafe function, which is easy to find. There's no way for us to take, um, root causes for vulnerabilities in software and solutions to them and plot them on a graph. Because they come from

they come from unquantifiable sources. Some of them like, you know, Spectre and Meltdown and stuff. They they're resident in hardware and the implementation there. Some are purely in software like X type vulnerabilities. We can't they don't they're it's, um, it's not even apples and oranges. It's like trying to compare apples and fighter jets. Um.

S1

Is it, is it a matter of, like the, the tensor size or the, um, I think that's called tensor size. I can't remember the, the, um, the number of dimensions in the space, because when you're looking at square footage and price what you have to write, is it the problem in security that is just so many dimensions that, um, when you try to plot it, you try to simplify it, it just becomes garbage.

S2

Well, it's a matter of common dimensions. So if you build a house, every house has square footage.

S1

There you go.

S2

And you can calculate the space underneath. But a cross site request forgery vulnerability in a, um, you know, piece of JavaScript code that exists on the web has almost nothing in common with a memory corruption vulnerability in a C program running on a router in your home device.

They are implemented at different levels of abstraction. You know, like even the program representations are different because some of the vulnerabilities might exist only in binary code after it's been compiled versus other vulnerabilities that are resident in source code that's interpreted via web browser. Um, so really what it is, is it's like trying to it's like trying to plot, you know, the prices of homes, along with the prices of, um, I don't know, oranges in a

particular year. You know, there's very little in common between a house and an orange other than maybe some, like, you know, global macro effects that might show some correlation, you know. You know, economic factors like inflation.

S1

Or like the beating of a whale's heart to determine whether or not it's healthy. It's it's like completely different. Uh, yeah. Completely different sports. Yeah. Yeah, yeah.

S2

Yeah. So, so really, it's a it's a lack of common dimensions in cybersecurity, which is why, you know, if we think about like if we were trying to model, like what the data would look like, if we could visualize it, it would just be a bunch of points of presence out there. Um, uh, within this, like, kind

of large cloud. Um, and even then, that's another problem that kind of makes cybersecurity really hard to model with AML is that there is really comparatively little data, um, in terms of like the volume of data, there's tons

of vulnerabilities out there. But if you're trying to make a model that's really, really good at, let's say, detecting, um, buffer overflows and embedded device code, um, you're going to find some data for that, but there's not that much you have to rely on like POC write ups on, on the internet for practitioners who put it out there

for fun. Um, but there's not a million of examples of that like, it is if you want to say, I want to train a model to write the Great American novel, there you can take you can take every novel ever written, throw it in there and then see what the model comes up with. If you prompt it with like a general plot line, it's going to do a lot better at that because, you know, that data fills in that space a lot more. Um, so so, yeah, it's, um. Yeah.

Like the, the, the challenges and problem formulation are, are really big and, um, yeah, that's why I kind of encourage people when they look at these like, okay, I want to build an AI, ML driven system. Um, take a look at what subproblems are actually suitable for AIML. Use them there. And I think you'll also find that a lot of the times we have a tendency to like say, okay, let's just kind of throw large language models at some of these problems that we know we

could really solve with regular code. Um, and that's really bad because of this compounding error problem. So, you know, if I, you know, five steps in sequence that I've got to do in step three is good for AIML and step four is good for AIML. You know, like it's like, okay, well, look, almost half of this problem is, you know, is something I'm going to ask the model to do anyway. I'll just ask it to do one, two and five to. Well, the problem is it can make a mistake in one. It can make a mistake

in two. That compound before you get to three and four. So you're better off, you know, implementing one, two and code. And then maybe you ask the model just to finish it off and do step five because it's the final step. It's had ground truth rooted in steps one two, steps three and four. If they're well contextualized problems, maybe the false positive rate is low enough that you can afford to just let the model kind of finish it up for you. But that's the biggest that's the biggest jump

I would take. Usually that's step five is like validation or correctness. Um, checking. And that's not something you want to ask the model to do because it's, it's it's it has the tendency to, um, one be wanting to kind of like please itself and say, oh yeah, it looks great to me. Um, or to, um, depending on how you phrase it, find something that doesn't exist. And validation is a problem that typically is, uh, is pretty amenable to like deterministic code.

S1

So I really love this. Um. Where this is taking me is designing, like, a, uh, a general problem solver. And I'm imagining, like, the smartest model that you have. You know, opus, whatever. Or, like, the best Gemini or whatever or whatever the best model is. But but then what you do is you say, okay, uh, the problem is we need to design a system that, uh, you know, properly, deterministically solves this problem with a high level of accuracy.

For example, the vulnerability problem that you guys worked on. And then what I love is the idea of you present to the model all these different AI models and all these different deterministic technologies, all as solutions. And then you do what you said, which is you, um, break down the problems that need to be solved at every

level of the subpieces. Right. And then you match each of those little problems to either one or, uh, one or many of these eyes, which are bigger or smaller, have different weaknesses or whatever, or even ML, not even LLM based. Yeah. Versus deterministic with the rule of like look, use the appropriate one for this problem type. And then maybe you have a whole bunch of training about problem types and solution types. And then it picks which one to use for each step. I mean is that.

S2

You mentioned this. I think this is what some of like the large, you know, third party ML as a service providers like OpenAI and anthropic are kind of trying to do. If you've heard of like this concept of like mixture of experts models, um, it's uh.

S1

That's true.

S2

Yeah. It's this concept where, you know, like, you know, like the, the actual interface. We have to maybe GPT five and, and I haven't looked at the source code.

I don't work at OpenAI, so I have no idea if this works underneath the hood, but it's been kind of theorized and it's even been mentioned, you know, a bit in terms of, um, you know, people who've kind of looked at the models a little bit closer that, you know, um, you know, when we, when we, we fine tune a model to make it really good or really suitable for a particular purpose that's amenable to AIML, it can still be challenging to, um, have it interface

with the user in the way that like a high quality chatbot would. So using yeah, a mixture of experts models suggests that like having like an interface, like a bot that interacts with the user but then recognizes certain classes of problems and ducts them to the right expert. So, oh, they're asking me about cyber. I'll ask, you know, um, cyber GPT to handle this one. All they're asking about, you know, mental health, I'll ask, you know, mental health

GPT to to help out here. Um, so, you know, this kind of like concept I think is I think it's trying to be creative, or at least it's been thought of, um, in terms of using like all AI, ML solutions. But but yeah, I agree, like the way forward is to have, um, you know, for, for like rapid like prototype development have like components that do certain things. Well, um, and honestly, it's like reflected in software, like we have libraries for, we have libraries for sorting. No one or

we have libraries for cryptography. Nobody should be writing their own cryptography code. Use a library. Um, you know, the closer these high quality libraries and, um, fine tuned ML applications or ML models for certain types of subproblems, the closer we get to being able to kind of compose all these together. And the good thing is, is that Elm is probably pretty good at writing the glue code to sequence all this stuff together.

S1

Yeah, yeah. Because because that's the trick for me. Because inside of a mixture of experts, you're already inside the LLM. What I'm thinking of this higher level model is like, look, we're doing it. We're doing, um, matrix math over here. We're doing multiplication over here. Um, guess what? This problem space is not associated with an AI. We don't even

know I will ever touch this. We hand it to our fastest and best, you know, deterministic addition function or whatever, you know, and it's like maybe 95% of the whole app ends up being traditional tech that doesn't involve AI, other than the routing to get there.

S2

Yeah, I mean, that would be ideal. I mean, anything you can route, anything. Anything you can. Yeah, I don't know. It's funny. It's like really what it comes down to is like using large language models and like, solving large problems. It becomes a conditional probability problem. And even if you have the answer, get the right answer right at 99% of the time. Um, over and over and over again, you still have a high likelihood of failure by the

time you compute all the conditional probability out. It's kind of funny. Like, I kind of learned this lesson in like, in a completely different walk of life. Um, after I got my bachelor's degree in CS, I, I worked for like a year doing, um, software engineering and kind of found it to be dull, so I, I, I did something completely different. I joined the Army and I started flying helicopters. Um, it's actually nice. That is, that's actually, you know, I'm at up at Camp Dwyer in in

RC Southwest and Afghanistan. It's, um, picture was taken of our aircraft on the flight line, and one of my jobs as a pilot was to educate our junior pilots on this concept of, like, mission survivability. Um, and that's the idea that, um, you know, understanding what's called, like, the kill chain. The kill chain has been pretty popularized

and security as well. But, you know, basically for a for a compromise, whether it's shooting down an aircraft or breaching a database, like a lot of things have to happen and they all have some sort of probability. And your goal in breaking the kill chain or breaking the exploitation chain is to reduce any one probability down to zero, because then the common or the conditional probability problem becomes zero. Um,

but the probabilities can be really weird. I used to talk to my junior pilots and ask them like, hey, what do you think is like the acceptable loss rate on any of the missions that we fly here in theater? And they would usually give me answers like they were pretty close. They'd say like 90% or 95% or even 99%. So I would actually take them to the math problem. I get off the whiteboard and I'd say, okay, let's assume it's 99%. I say, okay, how many aircraft are

we flying a day? Okay. You know, we have ten total aircraft. We go on five missions a day. So that's five aircraft are going out there. And let's say there's only a 1% chance that each one of them gets shot down. Okay. So that's five aircraft a day. But we're going to be in we're going to be in theater for for nine months. We'll round it off. We'll make it a year. We're going to be here for 365 days. So now if I take 365 by five and multiply it by five, that's the number of

missions we're flying in the entire time we're here. This number comes out to be pretty high. And now all of a sudden, if I lose one aircraft for every 100, you realize that I actually run out of aircraft in the first two months of of being in theater and I. And now all of a sudden, the troops don't have, don't have helicopters to fly.

S1

Yeah.

S2

So I said, actually, believe it or not, our our acceptable loss rate is something more like 99.99999%. Um, we can almost never lose an aircraft because. Or we can almost never accept any type of probability. That means we have even a remote chance of losing an aircraft because we will deplete them. It's a limited resource. Um, solving

problems with Llms is the same way. If you ask them to solve 15 problems in a row, even if it's got a 99% chance, which is which would be amazing if any LLM could get anywhere close to that, even if it has a 99% chance of answering every single problem right over the course of a year, it's probably going to give you answers that are wrong almost 80% of the time if that chain is long enough. And if you have enough problems that you feed through it.

So that's one thing I try to like, um, hope people conceptualize over relying on large language models and try to help them understand this, like compounding error problem. It's really a conditional probability, uh, compounding conditional probability problem. And your tolerance for false positives is actually zero. So anywhere in this chain that you can we have to think about this differently now because I can't reduce anything to zero.

But what I can do is I can take certain parts of the chain and I can bump them up to 100%, meaning my chances of getting something right when I use a deterministic algorithm are 100%. So now I no longer have some sort of fractional probability out of. So this 15 step problem now let's say 12 steps I do deterministically. Now I only have a three step chain. And now that 99% I'm getting it right only three times.

You simplify this problem. Now I might be able to make it through a year's worth of operations that, you know, 100 examples of the problem a day. I might be able to make it through that with a false positive rate of. I don't know what the math is in my head. I'd have to I have to punch it out. But that false positive rate might be a lot more survivable in an operational world than, you know, 15 conditional probability problems that are all 99%.

S1

Yeah, yeah, I love that. The way I describe it is, um, what's 1% of 100 metric tons of problems.

S2

A metric.

S1

A metric ton of problems?

S2

Yeah, I love that. I love that.

S1

Yeah. Yeah. Um, so, uh, we share this in common, actually. So, um, I was, um, I was also Army, and I was at. I was at Fort Campbell, so I was air assault, so I had to do all the helicopter stuff.

S2

Uh, right on, man. Hell, yeah. Brother.

S1

Yeah. That's cool. Airborne air assault. Right? Um, yeah.

S2

No. Yeah, I, I was, um, uh, this this picture was taken when we were doing, uh, medevac chase. Uh, we we did security for those guys over there, but I was in an air assault battalion, so we literally did nothing but fly you guys around, so.

S1

Oh.

S2

Nice man. Small world. Dude.

S1

Yeah, yeah.

S2

Yeah, I was over at Fort Campbell. I, I was at, um. I was at Fort Riley, uh, in in the first cab and then, um, I PC from there after I went to Afghanistan and went to the 82nd. Um, so I never got, never quite got to Campbell, which, like, would have been great because I live here in Ohio and Cincinnati. It's like where I was from. So I was like always trying to get to Campbell because it was like only like 4 or 5 hours from home

and be able to see family a lot easier. But I ended up like 12 and nine hours away, respectively, so, uh.

S1

Yeah. Well, that's super cool. Yeah, well, we need to chat some more. Man. This is, like, really, really cool stuff. Um, what you guys did on the team is cool, but I'm even more excited just about the way you think about these things. Um, I'm. I'm, uh, happy that, um, the way you're thinking about it is similar to the way I'm thinking about it. I you've taught me a lot just during this thing. We should we should definitely chat more after this. Um, anything else you want to

share about the the competition or, um, lessons learned? Um.

S2

So I think one of the things that that came out of the competition, um, was a lot of vindication. Sorry. I nudged mouse in it. Oh. So, um, I'll just I'll just go right into the answer. I assume you

can edit this later or something, but yeah. Um, so yeah, one of the things that, um, that came out of the competition was, was honestly a lot of indication, um, like I had mentioned before, you know, when we started off this process, um, this was two years ago, which has been two lifetimes in the development of like AI

enabled systems for any problem, much less cybersecurity. Um, so a lot of the things that we did, like tool enabling, um, and multi-agent systems were things that we did before, things like MCP or um, complicated, um, libraries for supporting this existed, like we used early versions of um, of long chain, uh, for some of our multi-agent stuff, but we actually ended up having to write a lot of and implement a

lot of our own glue code for this. Um, so it's really vindicating to see, like, those techniques become, while we're doing the competition, become not only one commonplace and two supported by the major large language model, providers be adopted and be used generally by the community. Um, you know, it was really great that we came in second and that also the first place finisher also used this like kind of, um, use, um, problem solving techniques that are

well suited for the problem approach. Yeah. Don't use AI everywhere. Um, finisher theory. They were a little bit more LM forward, but they still had a lot of, like, traditional components. I don't think any team really went after this. Like, all LM tried to just do everything within the LM. Um.

S1

I bet a lot started that way, and they they fall back from it. Yeah.

S2

Yeah. Yeah, I think I think at least one of them, um, at least one team. I think all you need is a fuzzing brain. I think in the semi-finals, their approach, um, tried to just use an LM to augment a fuzzer to find vulnerabilities. And I don't think they really had much of, like, a solution for patching, but it was enough to get them to the finals. I they had a more well rounded system, I believe, uh, in the,

in the finals. Um, so yeah, it was kind of vindicating to also see that all these other bright minds out there were also similarly of the, of the mindset to do this. But um, one of the biggest takeaways I have that I'll, that I'll say is that was like different than what I expected because it's really easy to pat myself on the back and say, oh yeah, all the plan I came up with worked great. That's

that's awesome. But, um, I will say that I was really surprised at how well large language models eventually became at helping us generate patches and also helping us generate seed inputs to improve Fuzzer performance. Those were areas where I didn't really give the LLM a lot of credit up front, but I had to build an autonomous system, so I had no choice. They really outperformed my expectations. So I kind of came out of this with, um, a bit of a healthier respect for the capabilities of

AI models. Once again, these are still highly constrained.

S1

And yeah, yeah.

S2

Very context rich problems that we ask them to do, but they still did way better than I thought they were going to do. Um.

S1

Yeah. And also context constrained, not polluted, like a very controlled context for that thing. Like like you were talking about before, right?

S2

Yeah. Yeah. Um, yeah, I think that's about it. Unfortunately, I do have to jump off. I gotta I got another call at 1230, but, um. Yeah, I'd love to chat more and talk more with you at some point. If you want to do a follow up episode or, I don't know, you just want to chat about other stuff. Um, you know, we got a couple of friends in common between Clint and, uh, between Clint and Keith, and it's, uh,

you know, I've. I've run into you a couple places on various calls and stuff that we've been on, but, um, it was good to get a chance to talk with you one on one. I feel like we've been kind of, like, circling around in the same circle for a while, but I hadn't had a chance to, like, actually just chat the two of us.

S1

Yeah, absolutely. Well, thanks. Thanks for the, uh, the input. This is just, uh, fantastic stuff. And, uh, let's definitely catch up soon.

S2

Yeah. Sounds good man. Take care of yourself.

S1

All right. Take care.

Transcript source: Provided by creator in RSS feed: download file
For the best experience, listen in Metacast app for iOS or Android