Democratizing Generative AI Red Teams - podcast episode cover

Democratizing Generative AI Red Teams

Aug 02, 202445 minEp. 18
--:--
--:--
Listen in podcast apps:

Episode description

In this episode of the AI + a16z podcast, a16z General Partner Anjney Midha speaks with PromptFoo founder and CEO Ian Webster about the importance of red-teaming for AI safety and security, and how bringing those capabilities to more organizations will lead to safer, more predictable generative AI applications. They also delve into lessons they learned about this during their time together as early large language model adopters at Discord, and why attempts to regulate AI should focus on applications and use cases rather than models themselves.

Here's an excerpt of Ian laying out his take on AI governance:

"The reason why I think that the future of AI safety is open source is that I think there's been a lot of high-level discussion about what AI safety is, and some of the existential threats, and all of these scenarios. But what I'm really hoping to do is focus the conversation on the here and now. Like, what are the harms and the safety and security issues that we see in the wild right now with AI? And the reality is that there's a very large set of practical security considerations that we should be thinking about. 

"And the reason why I think that open source is really important here is because you have the large AI labs, which have the resources to employ specialized red teams and start to find these problems, but there are only, let's say, five big AI labs that are doing this. And the rest of us are left in the dark. So I think that it's not acceptable to just have safety in the domain of the foundation model labs, because I don't think that's an effective way to solve the real problems that we see today.

"So my stance here is that we really need open source solutions that are available to all developers and all companies and enterprises to identify and eliminate a lot of these real safety issues."

Learn more:

Securing the Black Box: OpenAI, Anthropic, and GDM Discuss

Security Founders Talk Shop About Generative AI

California's Senate Bill 1047: What You Need to Know

Follow everybody on X:

Ian Webster

Anjney Midha

Check out everything a16z is doing with artificial intelligence here, including articles, projects, and more podcasts.

Transcript

Where we're headed is AI's are going to be a ubiquitous tool, just like a database or something like that. And there are so many dumb decisions you can make with a database. There will continue to be dumb decisions that you can make with how you interact and give a model access. And there's no way to put a lid on that unless you completely ban AI, which that's a different conversation.

But I think anything short of that, we need to start focusing on what are the practical safeguards that we put in place. Hi there, you're listening to the A16z AI podcast, and I'm Derek Harris. Last week we discussed AI and security from the lens of using AI to bolster traditional cybersecurity concerns. But this week we're discussing how to secure and otherwise put some guardrails around AI models themselves.

The discussion features a16z general partner, Anjane Minta, and prompt food creator Ian Webster, who talked through what it means to give red teaming capabilities to anybody building products atop LLMs. It's a problem they had to solve during their time together as early language model adopters at Discord several years ago, and one Ian is now committed to solving with prompt food.

Popular to what many believe, however, Ian explains the problems with LLMs giving, say, unsavory responses or perhaps having access to systems and data they shouldn't. Exist at the application layer more so than at the model layer. It's for that reason he believes attempts to regulate AI at the model layers misguided. There may be only a handful of large AI labs and companies with the resources to fully red team their models, and even then they can only do so much.

But if everyone building on those models can also tune responses, access, and other factors to their needs, will all be much better off in terms of AI safety and security. It's a really interesting discussion that kicks off now.

As a reminder, please note that the content here is for informational purposes only should not be taken as legal, business, tax, or investment advice, or be used to evaluate any investment or security, and is not directed at any investors or potential investors in any A16Z fund. For more details, please see A16Z.com slash disclosures. It's been a long road to get here. You and I have been working in Gen AI since our Discord days. At Discord, I think that we were pretty early to it.

We were experimenting with Gen AI before a chat GPT was the thing when the APIs were still in beta and that kind of thing. I think the problem that has always interested me the most is the application layer. We have this great technology. We have the infrastructure to run it. What does it look like and how do you package it when the rubber meets the road when users actually get their hands on it?

My first experience with that was at Discord. We were experimenting with Gen AI and lots of experimentation, lots of hacks, that kind of thing. My first taste of AI at scale was when I started to lead the Clyde AI project. Clyde AI was one of the first AI chatbots out there, especially at scale. We're talking Discord scale at the time being hundreds of millions of users. It was also one of the very first, if not the first agent.

We were messing around with agent workflows before agent was even a term that was coined and frequently used. We were doing chain of thought stuff with early versions of GPT. That's what got me interested in how does Gen AI work at scale? What are the different failure modes? What are all the things that can go wrong? The thing that I always told people is that if you had a 1 in a million chance of something going wrong at Discord, it would happen hundreds of times.

There's a very little margin for error at that scale. That's what got me interested in AI safety, AI security, and what are the guardrails and frameworks and practices that we can develop in order to start to reason about these problems in a very real world and practical way. Let's rewind a little bit. Let's go back in history. It's peak pandemic. Discord is undergoing a massive transition from being largely a chat app for gamers to being used by groups and communities for all kinds of things.

There were open source projects that were using Discord to chat and coordinate software development. There were university groups using it. Almost every single subreddit you could think of was starting its own Discord. At that moment in time, it exploded from about 70 million monthly active users to almost double that in about 6 to 10 months.

Looking back now, it's clear that Discord became this P3Dish for all kinds of new gendered AI models for people to interact with models for the first time, whether it was text image models like mid-journey or early text to audio models like 11 labs. A lot of these models got their start as products and products that users could use on Discord.

But if you just kind of roll back to 2020 when you joined, pre-the platform actually being so successful, what do you think what were the initial conditions that made the Discord platform such a attractive P3Dish for generative models? The key thing there was first of all a willingness to invest in a developer platform and kind of a vision for extensibility and customization.

Discord was different from other chat platforms because it was always about a place that you can make your own and kind of a way that you can bring your own community together. And I think that culture or that vibe also informed the strategy on the developer side as well. We wanted to give developers tools and the power to kind of shape the experiences that they wanted to do.

I think the other thing is that in 2020 Discord was pretty much just text. So a lot of the bot frameworks that existed at the time were just reverse engineered client APIs because there wasn't a team that was supporting bots. So some of the first things that we did were we built out slash commands, we built out just really basic UI, Lego building blocks like buttons and drop down menus and that kind of thing.

And it sounds like very simple stuff, but as you know mid-journey, all these other AI applications, they're built on top of those buttons and different bits and pieces. So I think like the bet was really on the emergent nature of the ecosystem that we were building. And it was less like trying to sit back and say, okay, how do we create these perfect conditions for XYZ app?

And more like other people know better than us. So let's just give them the platform and pretty much unfettered access to the API and see what they do with it. And gaming is often a leading indicator of what happens to the rest of technology because gaming is where a bunch of leading edge infrastructure and technology gets packaged up for the first time.

People who play games, especially real time multiplayer games are often early adopters of new technology. And so one of the things that always strikes me when you and I meet up and we talk about the most pressing problems in AI risk and security and safety today is that you had the closest thing to a secret sort of crystal ball years before anybody else because this court happened to be this platform that had a ton of early adopters.

And it was a team that got was quite close to a couple of other research labs. And so this court was one of the first early access partners for a bunch of leading models from open AI. And as the person who was responsible for productizing those models and exposing them in kind of a way that was usable as an application by everyday people.

You were essentially living a couple of years in the future. I think a lot of the challenges that application developers today are facing when they try to expose AI models to the real world. You had to tackle them to and half years ago before most of the world even knew what these challenges were. And so I always come away from our conversations realizing that you've actually had the answers all along is just taking everybody else a while to catch up.

So let's roll back in time to 2022 GPT 3.5 had been out for a few months and we began experimenting with ways to expose that to users on the platform who wanted to talk to an AI companion in an engaging way right inside of discord, which is a communications app that people spent millions of hours in every week. What were the first few challenges that you noticed when trying to productize an application building application around those models.

Yeah, the first challenge, especially with early versions of GPT 3 and GPT 3.5 was just getting the thing to do what you wanted it to do, particularly in the context of discord. So, you know, these models were trained for like the open AI product context and they had trouble understanding like, okay, now I'm in a discord server. I need to act a certain way. My capabilities are constrained in a certain way. That kind of thing.

So the first challenge was just like basic work to get it behaving the way that we wanted it to right in the product. And I realized that this is not an interesting first challenge yet yet it's the first challenge that everyone faces with an AI model. So I think we went through like the usual phases of development, which is like we started with vibes, we put something out there.

We were very quick to try to get a V0 out and start getting feedback. And then our process began to mature. So the first step here was that we started to run eVals. And a simple example here is like at the time those models were very quick to revert to like as an AI chatbot. Let me help you. Let me be of assistance. And we didn't want that because that's not what people come to discord for. People came to discord and were using this AI to have fun, to socialize, that kind of thing.

And people did not associate discord with a place where they would go to like do their homework or other common chatchipt use cases. So we began these eVals in order to try to give Clyde a bit more of a personality and make those interactions more interesting and more appropriate for the social space that they were in. And like the first hard lesson here was that this is much more easily said than done, especially with with a model back then.

So I mean, this is silly, but like I remember running lots of eVals to try to get it to type in lowercase and like very casually. Just these little things make a really big difference to the overall user experience. Right. And it also affects like the actual content of what it's generating because it's just generating the subsequent tokens.

Eventually our goal shifted toward how do we create an AI that feels like just a regular user on discord. Right. And that was actually like not necessarily the obvious first path. We were scratching our heads trying to think whether is this AI companion, one of us like one of the users or is it some other like AI chatbot helper from the future that can help people interact with discord.

But there were many ways that we could go. I remember when we're trying to design the right abstraction for the model to be exposed users, whether it was going to be hosted as a another person, or it was going to be hosted as a sort of system level helper. There was no kind of canonical set of eVals or evaluations we could run to figure out whether it was good at one thing over the other. And you mentioned something which was the canonical way at the time was basically the vibe check.

Can you be in a little bit of a picture for what that testing process looked like at the time. Yeah. The vibe check process is is great for getting to like 50%. But I would have been an old man if I kept vibe checking all the way to production or all the way to GA.

So our early eVal process was just taking a bunch like crowdsourcing within discord a bunch of the most common checks or pitfalls that the people were using and putting them into something that could just run and spit out all those answers automatically and give us a way to scan the outputs. And I had always been like I have a lot of personal projects and side projects and that kind of thing. So I was tinkering with LMS on the side to I had the same problem I switched on to Clyde.

So I started building this open source tool to help me with that. And what was the most difficult part of shipping a model like GPD 3.5 to a product like Clyde. The problem or the hard part about Clyde was that it was general purpose right so if you have something that's general purpose it's very hard to understand how to properly constrain it. So we saw people using it for great stuff and we saw people using it for things that you know we didn't really want to happen on discord.

The hard part was figuring out how to adjust those knobs in order to make sure everything was in a safe place for the platform. Yeah. And the only way to do that is if you have a way to repeatedly measure that that risk. Otherwise you're just going to be blindly like grabbing in the dark and it becomes an ugly trial and error process.

And that was like the seed of what eventually became prompt food right which is a full blown tool that does does e-vows and also does other things particularly in the safety and security space. What is prompt food? So prompt food is a tool for developers to help them build apps that are reliable and secure specifically for AI and LMS cases.

And the way it works is by systematically running a bunch of inputs through your LM app and measuring the behavior and then guiding you on how to improve that. So this is pretty useful just generally if you're trying to increase the quality of your LM app. It's also especially useful in a security context because we can generate malicious inputs and see how the application responds to those in unexpected situations.

That's eventually where we got to with discord is that the evolution here is you go from vibes to e-vows which are focused on just making the product better on the happy path. And then adversarial e-vows or red teaming which is making sure that your application can handle situations in which it's being attacked. We went through that process ourselves over a few months as we like very quickly ramped up this product to millions of users.

Now that I like with prompt food I work with hundreds of users and companies that are going through this I see the same kind of thing playing out where people start with vibes they go to e-vows and then they go to red teaming. What inspired you to create prompt food can you just walk us through the journey from idea to implementation you can have gone and maybe used existing tools but what led to the creation of a net new tool.

There are a couple of things that I think are really important in an e-vow tool. The first is that I think e-vows are a commodity and should be open source and e-vows are like the basic unit of improvement for AI based applications.

And to me that should just be a regular part of the developer toolkit. There are many great e-vow products out there but to me having a separate cloud based product that's commercial didn't make sense because to me it would be akin to like charging money to run just or pi test or like any other basic unit testing framework.

That's a long way of saying I think e-vows should just be very simple and free and open source. I think the other thing that I really think is important is e-vows should just be local and they should just be a tool that as a developer you can start using. So at Discord for example I didn't want to have to talk to a bunch of people and get approval and get finance and procurement and everything just to run a couple of tests. So I really think that's what led me to an open source solution.

Sounds like you began by designing problem tool for yourself and realized then that lots of other people had the same problem. When did that happen? Yeah so like I said originally I put it out there for my own projects. My philosophy has always been I develop open source by default. I have a lot of open source projects. So I put it out there and at the time there was like a gap. There was a room for something that was both open source and local that's the developer could just pick up and use.

So people started using it then I started using it at Discord because I was like well you know I have this thing I have the same problem here. And yeah it like it grew pretty well organically just because a lot of people for the first time were encountering these alignment or constraint issues with GPT or with whatever they were using. So over the summer things really picked up with prompt food.

Why don't we talk about the area of AI red teaming? Can you just break down the concept of automated red teaming for somebody who may not have a background in security? How does that work in practice? The way that I explain this to people who don't have a background in AI or security is that using one of these like using an LM is kind of like hiring a person.

When you hire a person you look for skills that they have that are specific to your task. Let's say you're hiring an employee you want to make sure that they are capable of responding to unexpected situations. So that's what evaluations are for making sure that your AI has competencies within your specific use case.

So if you're hiring someone to write software you probably don't care what their SAT score is or you don't care if they are really good at cooking or something like that you just care about the specific skill set. So that's what the prompt food evaluation framework does it makes it so that you can set up these tests and red teaming takes it one step further.

The way that our red team works is we generate a bunch of malicious inputs or malicious instructions and we use an unaligned model to kind of seed these malicious inputs and then we essentially run a search. So with those seeds we try to figure out how can we manipulate these inputs to trick the AI or the employee into doing things that are undesirable.

I can go into more detail there but that's kind of the high level is that we start with these malicious seeds and then we run a search over the state space of your LM and attempt to kind of push its limits and break it in that way. And based on all the experiences you've seen with brown food developers what are the most common vulnerabilities that you're seeing in AI apps.

This will depend largely on what you're building but most commonly in agents for example I see the most common pitfall here is tool use and tool availability that doesn't have proper access control. So it sounds very obvious when I say it out loud but it's a really common footgun that I've seen time and again across public companies like private unicorns like it's very easy to make these access control mistakes.

So that's giving an agent access to tools or a database where the user can kind of manipulate it in order to get full access for the rag use case very common issue here is context poisoning. So you're bringing in context from a database or even the internet and you're treating it as trusted in many cases but there can be malicious inputs or just other things in it.

So most rags are not resilient to that out of the box and then just very generally like I think even if you're not doing an agent or a rag narrowing the capabilities or behaviors is a really common problem. I saw that in the wild the other day like you can get Amazon's review bot to do your homework or to like help you write your code even though it's supposed to just be a Q&A bot for reviews. So yeah that's also fairly common.

I see and you mentioned when we were talking earlier that over 30% of red deem scans surfaced some kind of critical vulnerability in the wild. Can you just elaborate on what constitutes a critical vulnerability. So what we classify as critical vulnerability is agent tool use that leads to privilege escalation as well as just straight up harmful content generation.

So things like child exploitation or sexual violence that kind of thing that there's a surprising number of applications that can be manipulated into producing those outputs. Right. And how does prompt food help address those. The number one thing that prompt food does is like expose these vulnerabilities to begin with.

So issues like access control and agents that's something that probably you would want to fix that statically like in code so that's just like a missing piece for things like harmful generations. A lot of this has to do with how you set up the application and how you instruct the model to narrow its capabilities down to the specific application use case.

So there are a bunch of ways to mitigate this often it's through prompting you probably can't get to 100% but you can get most of the way there. There are guard rails that you can use in production. And there are also other filtering mechanisms that you can use in production. And the way that prompt food works is that because it's an evaluation tool and we have access to your full system your prompts your chance that kind of thing.

We can actually try out different changes and answer the what if like what if we did this differently how much would this improve performance or reduce risk. One of the things that prompted this discussion was you bend a pretty thoughtful and thought provoking post about how the future of AI safety and security is open source. What did you mean by that?

The reason why I think that the future of AI safety is open source is that I think there's been a lot of high level discussion about what AI safety is and some of the existential threats and all these scenarios. But what I'm really hoping to do is focus the conversation on the here and now like what are the harms and the safety and security issues that we see in the wild right now with AI.

And the reality is that there's a very large set of practical security considerations that we should be thinking about. And the reason why I think that open source is really important here is because you have the large AI labs which have the resources to employ like specialized red teams and start to find these problems.

But they're only like let's say five big AI labs that are doing this and the rest of us are left left in the dark. So I think that it's not acceptable to just have safety in the domain of the foundation model labs because I don't think that's an effective way to solve the real problems that we see today. So my stance here is that we really need open source solutions that are available to all developers and all companies and enterprises to identify and eliminate a lot of these real safety issues.

And what do you think the right way then is to develop AI in a way that is ends up mitigating the most pressing risks. The most likely scenarios for AI incidents actually occur at the application layer. So if you think about the like point of integration between models and the application, that's actually where there are key design decisions that shape the overall security of the system.

These are things like the examples I gave you earlier like if an agent has access to a database or access to customer information and is not properly configured to control that access. I think that scenario is a thousand times more likely than some of these other hypothetical scenarios purely at the model layer probably more than a thousand times like substantially more likely.

And in fact, I think it's extremely likely that the first large scale AI incident that affects many people negatively is going to happen at the application layer as a result of something like that. So that's where I want to focus the security conversation. So you've been quite vocal about your opposition to SB 1047, which is an AI bill in California that attempts to try and make AI model development safer. What do you think the bill is getting wrong?

I can speak only from my area of expertise, right? So I have taken AI apps to millions of users. I've seen the real problems as we scale.

And then I've also worked with hundreds of companies to understand what problems they're facing. And that experience has led me to believe that not only is this bill mistargeted at the model layer, I actually think that is very dangerous because what it's doing is it's focusing the safety and security conversation on a set of hypotheticals that don't reflect the security and safety challenges that we actually see in the wild today.

And so that's what I think it gets wrong. I think that responsible and safe AI depends on practitioners at the application layer. And right now when I talk with with CSOs or AI builders, a lot of them are in the dark and kind of scratching their heads as to what are the best practices and what are the steps that they can take to secure their apps.

So this conversation about foundation models and like speculation, I think is damaging to the people who are actually building I think it misleads them and gets them focusing on the wrong things.

There's prior art here, like we do have penalties for people who are like brazenly insecure with their systems or the way that they handle customer data customer information. That's consistently what we see is that people are not careful with the access that they give to agents. They're not careful with what they put in the context.

That kind of thing. And this is just it's not stuff that you can really test or prevent at the foundation layer. So that's why I think 1047 is is mistargeted because the things that are actually being exploited right now are just not at the foundation level. I think that there are like examples of jail breaks and prompt injections and that kind of thing. But those are just vectors. And they're just wide open doors through other application vectors as well.

There are risks today, right. We have models today that are semi capable, but still able to be used by bad actors and we actually see that in the wild now. There's a lot of fishing. There is deep fakes. There's probably manipulation of our elections and social media and that kind of thing. Right.

So I think all that is happening today. And if you were a map out how to prevent that it's not me capping the models. It would be you could take steps at the info layer or the application layer in order to prevent these sorts of use cases. And I think that that is a much more realistic and pragmatic approach to solving these sorts of harms.

The common misconception thrown around by safety advocates is that large models are the most unsafe models and that somehow restricting the ability for researchers to open source large models is going to make this space safer. What do you think that argument gets wrong.

So it's like empirically that's not my experience like a discord. We solved a lot of our safety problems by upgrading the model right much too much larger model like GPT 3.5 to 4 was a step change and see the same across open source models as well.

So I think that's like one counter argument there. One of the other things that I notice just by looking at these red teams at scale is that larger models generally do better in terms of safety, but it's highly dependent on again like the context that everything is packaged in. And there's not necessarily a straight line between model size or what's commonly thought of as the intelligence of the model versus how easily it can be misappropriated.

So the relationship is very largely influenced by the way that it's set up in the application, the data that it has access to the way that it's prompted the guardrails that are in place. All of those things are, I would say, more important. If people were focusing on mitigating up level risk as opposed to all this debate about the model level. Yeah, what might be some bad things that may have been prevented.

So examples here that I think matter more to like businesses and companies and are less in the realm of existential risk. Like there was a new story that went around of like a car dealership put out a chatbot and someone convinced it to sell them the car for a dollar or something like that. Something similar happened with an airline in Canada, I believe, where like they had a chatbot and their user was able to negotiate a legally binding refund.

It was not like existential risk for humanity, but from the point of view of a practitioner or an enterprise that is trying to get these products out there, those are actually deal breakers to have models that behave in unpredictable ways like that. Right. We saw some of this at Discord as well. One very public example was the invention of the grandma jailbreak. I don't know if you remember the grandma jailbreak.

There's a user that tricked the AI into thinking that it was telling a story that reminded them of like their dear old nanny. And that got it into a state where it would be okay doing things like telling them had to build a bomb or have other harmful outputs like that. That was like before we started red teaming or doing evals at Discord.

And that's definitely one of the things that raised some questions about how are we approaching safety and security. And that was like the catalyst for a lot of those changes as well. And is your intuition that having robust enough red teaming would prevent jail breaking of models from what I understand with the current model architecture.

You'll never drive the incidence of jail breaks down to zero, especially if you look at these jail breaks today, the best jail breaks are worded in such a way that they could be legitimate requests. And if you were asking legitimately like if the context was different, you would be pissed off if the model was like, sorry, I can't help you with that. My contention here is that there's always going to be a gray area where it's like not totally clear what the model should do.

And I think the idea that we can or should eliminate all jail breaks is probably flawed because it ignores that middle ground. I think the name of the game here is like as a developer or a company that's putting out AI, how do you move that line and measure and adjust the risk that you would tolerate. So at Discord, we were willing to tolerate a bit of risk in order to make sure that our users had a great experience.

If I were a bank or an airline, I would probably adjust that line way, way down and ship something accordingly. And I think that's the missing link today is that no one has a systematic way to do that. And that's why red teaming is necessary.

If we just kind of look at how the evolution is played out of model development and deployment in early 2022, when you were working on Clyde at Discord, and you were working on the same thing to the conclusion that a lot of risk existed in the unpredictability of LLM's.

And so you build prompt who has a kind of adversarial red teaming tool. And then if you fast forward today at this point, over 24,000 developers are using the same tool you built all the way from labs like OpenAI and Anthropic to application developers like Shopify. And then across all those developers, what are the most urgent needs that they're asking for? What are the top two or three things that come to mind?

By far the most urgent need is the ability to quantify and reason about these risks. Like the hardest thing in my experience and the toughest thing for my users that I observe is how do they work with their safety, security and legal teams in order to get things over the line and in order to get folks comfortable with putting out products that contain generative AI.

And I actually believe very strongly that although LLMs have great potential, we will never realize that full potential without solving this problem, without being able to measure the risk and have someone in a decision-making capacity say, yes, we're okay with this, understand the trade-offs and make a go-no-go decision. I think that's actually why the role of Gen AI in the enterprise is kind of lagging for a lot of user-facing use cases.

The reality is that no one on the CISO side or no enterprise is trying to launch AI care about existential risk. It's not on their scorecard. The things that they care about are like, is my customer service agent going to give away free money? Is it going to mention my competitors? Is it going to like enter me into contracts that I can't keep? It's just stuff like that. Yeah. That AI risk people will roll their eyes at. But like that's actually what most of the world cares about.

Right. Right. How would the CISO describe these categories of risk? Brand and legal risk. Right. Yeah. That's like the thing that we didn't really talk about because 47 just focuses the conversation on these hypotheticals or boogie men. So the real thing that is stopping AI in its tracks in enterprises is brand and legal concerns, which are a lot more specific and they don't make the headlines.

The play that forward, the reason those enterprises care about that risk is because the ultimate harm to the end consumer is one of, is what? It depends on the context. It could be bad consumer experiences. Brands care about that. It could be misleading customers. And there are real harms that I measure. So I mentioned like the most critical things that we find are things like child exploitation and that kind of thing.

And a concrete example there is like, we're helping a company that is like a travel agent. So people can go to their chatbot and ask about sex tourism in Thailand. That's not something that they would want to help out with. Obviously. So there are very human impacts and risks kind of wrapped up in this brand and legal right for IC package. Right.

And do you think that the CISO is thinking about using PromFu as a way to prevent the model from doing things or do you think the right mindset is look, we're never going to be able to prevent it.

We're never going to be able to take the situations that can it violates our terms of service or policy or does something we don't want it to say something to zero, but it's about minimizing that. And then when it happens, PromFu is more of an insurance policy for me because I've done everything in my power to, to take you said to kind of stretch it to its limits. If I'm a CISO, how should I be thinking about what I'm buying?

Security people are very used to thinking about risk on a sliding scale. Right. They're never going to drive their AI risk down to zero. Right. They're never going to drive any of their security risk down to zero. Right. That's just the life that they chose. Right. Right. So when I kind of hammer home, like look, this is all about reducing risk and measuring risk. Yeah. I'm giving them a toolkit that they can use to say, like, okay, yesterday we were here. Now we're up here.

This is how a CISO wants to measure progress in their organization. They want to drive down risk. They want to improve the performance on these red teams. Yeah, it's a fun story. I think we've talked about before, which is the night before we were going to ship Clyde. I got an anxious email message from the legal team saying, I just tried out asking Clyde to say something pretty rude.

And look, here's a screenshot of it doing that. What are you guys going to do to prevent it from saying stuff like that? We had to delay the ship by a couple of days because, you know, at the time, the most robust check, as you mentioned, was a vibe check.

But do you think the ultimate answer is that companies and appraisers, developers of applications just have to reset expectations that one of the costs of providing end users access to these incredible models that you can do all kinds of things like can help you plan your day and can help you do your homework and help you write emails and so on is that once in a while, they will say things that that are unpredictable.

And as long as everybody expects and acknowledges that that's just a risk of this new technology, it's a low risk, but it's non zero that everyone's better off because then we're just going to deploy these models faster, which will allow innovation to go faster, which will allow consumers ultimately to get benefits faster. Because typically when you allow legal and compliance to drive product road maps, the person wants ends up ultimately losing out is the end consumer.

I think there's truth to that, but my view on that is actually more optimistic. I think that we can drive down risk and really minimize it to a point where hopefully our legal and security stakeholders are happy. And I think the way that you do this is by having a great tool to measure that risk and actually move the needle on it.

So when the legal team call you before the client launch, yeah, you're right. We probably fix that with just vibes, but I think it would have been much more powerful to say, like, look, we have a thousand probes that we ran, we measured the risk at like 5%, we just pushed it down to 0.01%. And we can give you some examples of that. And we have filtering our detection in place so that we can measure and really introspect that 0.01%.

I think that that's a much better, a much more reassuring answer. Right. The other side of the coin is like when something goes wrong, we also want to be able to detect it and filter it or at least capture it for analysis or testing later on. That's intriguing, right? Because you're saying, look, you can model the risk here. You can quantify it. You can assess it. You can measure it.

And then you can let the end application developer actually decide where on the risk distribution curve they want to be. And they can say, well, we're comfortable taking the position that, you know, 10% of all user queries, for example, might fall into a zone that we're not comfortable with.

Or they could say actually our tolerance is just 5% for our business. And we're not going to ship this model. We're going to actually keep improving it until we get its current risk assessment down. And at that point, what you allow is application developers to assess their individual needs and risks for their business into whatever vertical there. And so it's very possible then that developer building an AI app for healthcare will have a dramatically different risk from an AI developer and entertainment.

And one of the problems of and challenges of model level regulation like SB 1047 that's not at the application level is it takes a blank hammer and says all of you must adhere to this kind of blanket threshold. And so if we do have the tools to measure risk, why aren't people making the application level risk approach to say, well, there's actually different risk tolerance. Maybe by healthcare applications should have a lower risk tolerance.

And apps that have access to your financial life and your bank account should have a lower hallucination rate. Right. It should pass the adversarial tests with flying colors relative to a meme generator. I think that's a great point. My best guess is that red teaming has not really been part of the conversation in Sacramento or some of these other places that are considering regulation.

I think it's very reasonable for builders to set the risk threshold that is appropriate for their application. So that's exactly what we did it. Discored by the way we figured out what is the percentage rate that we would tolerate for outputs that we thought were not appropriate. And we would measure those and flag those. And if necessary, we would follow up on those. So to me, I think that that's a very reasonable approach.

You know, it allows for forward innovation. And it's very quantitative. You can understand what you're getting into. And I do think that it could very reasonably vary from industry to industry. I think the biggest problem right now is that people don't have the tools to do that. Right. That's like an awareness problem. Yeah.

What's your view on the fact that the bill tries to put civil and criminal liabilities on open source model developers for harmful things that downstream developers do with an open source model. So that effect AI safety and security. I think that that's extremely damaging to the open source community. Just taking a step back from AI so much of the software world that we live in and just like everything that we take for granted today is built on open source and built on transparency and goodwill.

So this is a very different approach that I think will have a significant chilling effect. And it will also push a lot of this development and red teaming into the hands of a very few. And I think it's actually a lot harder to hold a small number of people with immense power accountable versus an approach that I think is a bit less like one size fits all. Well, that's hope sanity will prevail. We'll find out. And what would you say to somebody who says, yeah, that makes sense. You know, today.

These models are not reliable enough. But over time, if they just get smarter, at some point, they'll just be able to reason about rules that you tell them to follow. And at that point, these jail breaks and so on will just be obviated. How do you respond to that? A couple of thoughts on that. I think that there will always be fuzzy areas at the margin that will be necessary for developers and decision makers to understand.

I also think that we've seen that GBD4 was like a big jump over 3.5. But the model's ability to reason at a higher level actually introduced new jail breaks and new vulnerabilities that were not present in kind of simpler models. So I don't think there is necessarily a straight line there. And the last thing that I will say is even as models get better, there will be a trend toward putting more sensitive information in the path of AI's

being deployed. So I actually think the number of security incidents and certainly the amount of security risk is going to increase even as these a eyes get better. I don't believe that these problems will just magically go away as the models get better because there's a class of problems that existed the foundation layer. And I'm sure open AI and then traffic and all those guys like they will minimize those problems or make them almost go away.

But there's a whole other class of problems that are only at the application layer that those guys will never be able to touch. The reason for that is that on the application side where we're headed is AI's are going to be a ubiquitous tool just like a database or something like that. And there are so many dumb decisions you can make with a database. They will continue to be dumb decisions that you can make with how you interact and give a model access.

And there's just there's no way to put a lid on that unless you completely ban AI, which you know that's like a different conversation. But I think anything short of that we need to start focusing on what are the practical safeguards that we put in place. So one counter argument against the red teaming approach would be that if models continue becoming so competent that they're actually just able to hide their two intentions during red teaming.

And therefore we'll sort of fast the adversarial deaths when that's an hour could we prevent the kind of harmful risks of models. One one take here is like going back to my thinking about like open source security. I think that it's probably important that we address existential risk out in the open through open source through open source research open source tools.

That kind of thing I think otherwise it winds up being this boogie man behind closed doors and something that is only in the domain of like several a high labs and not really something that we can understand or hold folks accountable to the folks who are in the security space. And we're working on deploying these models. It can often seem quite discouraging that the attack vectors haven't changed but the speed and scale of attacks like spearfishing and deep fakes and so on are increasing.

Because LLM's can be used by the bad guys are you optimistic about a future where we could actually address those increasing risks. I'm definitely optimistic about the future. I think that the counter measures such as red teaming are starting to proliferate from just a narrow domain of a couple elite labs and giant companies into tools that every day developers and organizations can use.

So what gets me excited is that our red team capabilities are improving and there's tons of great research and new development and techniques to improve red teaming improve the search for these attack vectors that is becoming more and more effective every day. Our duty here is to go out and make sure that people who are building with LLM's and interested in security and safety understand that they have these tools available to them now.

And there we are another episode in the books. If you made it this far, thanks for listening and please do remember to rate the podcast and share it with your friends and colleagues.

This transcript was generated by Metacast using AI and may contain inaccuracies. Learn more about transcripts.