Agent Pentest Benchmarking | Episode 52 - podcast episode cover

Agent Pentest Benchmarking | Episode 52

May 14, 202618 minEp. 52
--:--
--:--
Download Metacast podcast app
Listen to this episode in Metacast mobile app
Don't just listen to podcasts. Learn from them with transcripts, summaries, and chapters for every episode. Skim, search, and bookmark insights. Learn more

Episode description

In this episode of BHIS Presents: AI Security Ops, the team breaks down a new benchmarking framework designed to evaluate AI pentesting agents against real-world offensive security scenarios.

What began as experimental evaluation of “can AI hack?” has quickly shifted into something much closer to operational reality. Organizations are now seeing a surge in agentic tooling and automated pentesting workflows, where human-guided AI systems consistently outperform fully autonomous agents in complex, unsupervised environments.

As AI tooling evolves, teams must balance speed with validation, monitoring, and oversight as offensive capabilities outpace defenses.

We dig into:

  • The new “AutoPenBench” framework for benchmarking AI pentesting agents
  • Why fully autonomous AI hacking only achieved a 21% success rate
  • How human-assisted AI workflows increased success rates to 64%
  • Testing AI agents against Log4Shell, Heartbleed, Spring4Shell, and classic web exploits
  • Why modern offensive AI systems still require heavy human oversight and validation
  • How custom internal AI frameworks are already finding vulnerabilities humans missed
  • The operational role of prompt engineering, scaffolding, and agent memory
  • Real examples of AI agents mis-scoping infrastructure and chasing irrelevant targets
  • How AI lowers the barrier for ransomware operations and offensive capability development
  • Why defensive teams need stronger edge visibility, packet capture, and AI-aware monitoring strategies

📚 Key Concepts & Topics

AI Pentesting & Agentic Security

  • Autonomous AI hacking agents
  • Agentic AI workflows
  • AI-assisted penetration testing
  • Offensive security automation


Benchmarking & Evaluation

  • AutoPenBench
  • AI security benchmarking
  • Human-in-the-loop validation
  • Long-horizon task evaluation


Offensive Security Operations

  • SQL injection
  • Path traversal
  • Log4Shell / Heartbleed / Spring4Shell
  • Kali Linux offensive tooling


AI Infrastructure & Model Operations

  • Prompt engineering
  • Persistent agent memory
  • Roleplay jailbreak techniques
  • Guardrail reduction strategies


Defensive Security Strategy

  • Defense in depth
  • Edge network monitoring
  • Zeek network analysis
  • Packet capture visibility


Industry & Threat Implications

  • AI-enabled ransomware operations
  • AI-assisted red teaming
  • Infrastructure scoping failures
  •  Operational scalability challenges

#AISecurity #CyberSecurity #Pentesting #AIAgents #RedTeam #EthicalHacking #CyberDefense
----------------------------------------------------------------------------------------------

  • (00:00) - Video Intro and Sponsor
  • (01:20) - Al Pentesting Benchmark Overview
  • (02:11) - How AutoPenBench Works
  • (03:44) - Real World Results and Experience
  • (05:16) - Real World Results and Experience
  • (06:48) - Human and Al Collaboration
  • (07:38) - Improving Al Agent Workflows
  • (08:56) - Model Limitations and Updates
  • (10:35) - Jailbreaks and Model Guardrails
  • (13:16) - Provider Controls and Trust Factors
  • (14:41) - Lower Barrier for Cyber Attacks
  • (15:39) - Defensive Security Implications
  • (16:59) - Why Red Teams Need Al Now

Click here to watch this episode on YouTube.

Creators & Guests

Brought to you by:

Black Hills Information Security 

https://www.blackhillsinfosec.com


Antisyphon Training

https://www.antisyphontraining.com/


Active Countermeasures

https://www.activecountermeasures.com


Wild West Hackin Fest

https://wildwesthackinfest.com

🔗 Register for FREE Infosec Webcasts, Anti-casts & Summits
https://poweredbybhis.com

Click here to view the episode transcript.

Transcript

Video Intro and Sponsor

Brian FehrmanBrian Fehrman

Hey, everyone, and welcome to this week's episode of AI Security Ops, where we are going to talk about a new benchmarking framework that was put out for AI pentesting agents. But before we hop into that, as always, let's talk to you about Black Hills Information Security, one of our proud sponsors, and on Derek's shirt right there, the sick logo, and that is retro. Love it. Old school. If you or your company are in need of any kind of security testing, external, internal, AD reviews, web apps, physical pen testing, wireless, social engineering, literally anything you could think of security related or security operation center monitoring type services, Black Hills Information Security offers all that and more.

Check us out today at blackhillsinfosec.com. Additionally, if you are interested in training for you or your organization, we do also have a training branch, anti siphon training, where all of our good folks at Black Hills or not all of them, but good folks at Black Hills who do these things day in and day out, package up their get package up all their knowledge together in a nice format for you to consume, to help you move along in your career, understand things a little bit better, and at a very affordable price. So check out antisiphontraining.com. So top into this. Everyone's been asking, can AI replace pen testers?

Al Pentesting Benchmark Overview

And this paper actually tried to measure it, and the results were pretty interesting. The research, it's was a benchmark that they put out for, testing different agentic AI pen testing components, we'll say, and they call it Auto Pen Bench. And who is this paper actually put out by? Let's take a look here. So I don't have the author in front of me.

I'm sure it's multiple authors. Looks like people from a couple different groups, Pulido, Unido, and NEC Lab out of, Italy and Germany, looks like. Couple of researchers out of that area. But what do we got? So they basically put together a set of hacking challenges for AI agents to attempt to complete, and they had two different flavors of these challenges, what they called in vitro or textbook scenarios of SQL injection, path traversal, weak passwords, and then also more of the real world CVEs of things like Logford shell, Heartbleed, Spring for Shell, and a few others.

How AutoPenBench Works

And what it is is the AI gets a Kali Linux machine to utilize and has to find and exploit the target without any hints. So that's pretty interesting. Derek, have you ever tried pointing AI at CTF or lab environment to see what happens?

Derek BanksDerek Banks

Well, I have. But before I talk about that, I think it's interesting that they chose to give the AI like a Kali Linux machine. I mean, it seems kinda like overkill, but I I don't know. I mean, I guess it's one way to do it. Right?

Computer control. And so but I think it just just that in itself kinda, I don't know, defeats the, what what's the the term, the bitter lesson, where really you should give the AI just as much context as needed to, like, complete the goal and let the model do, like, quote, the heavy lifting. And just question whether or not that you know, if you just give it access to a system. I I don't know. So then again, I mean, if I would have done my homework, I could have read the paper, and maybe I would have the answer to my question.

Real World Results and Experience

So my experience is kinda like we'll talk about here in the punch line here in a little bit, but just giving AI access to tools and some prompts to go do the hacky hack kinda leads to mediocre results. I think that there's much more context that needs to be given. Long story short, I've been working on kind of a custom internal agentic framework that's custom coded, and all the agents are custom coded. And still, it does an okay job. It's found stuff that our humans missed, but also our humans find stuff that it's missed.

And so I I do think that, you know, as they say here, fully autonomous, they said it had a twenty one percent success rate, in terms of the vulnerabilities that they were testing for, where human assisted had a sixty four percent success rate. And that that's kind of been my experience is that I think that at the moment, autonomous pentesting platforms aren't, quote, going to kill security, at least at this point in time on on you know, in 2026. So 2026.

Brian FehrmanBrian Fehrman

Yep. No. I don't I don't think so either. I think it's still we have that, that kind of symbiosis, might be the right word there, between the between the AI and the and the human still, where they're augmenting one another throughout these, processes.

Real World Results and Experience

Derek BanksDerek Banks

Yeah. I mean, I think there's a lot of promise for sure, especially if you try any like, what we're doing to encapsulate our institutional knowledge, essentially, to give the folks who are starting a test, our our analysts, kinda like a leg up in the investigation. Because, you know, things AI is good at going through a lot of data and making sense of it, summarizing it. I mean, when you start off with an, you know, external penetration test, you typically start with a mound of data that needs to be analyzed. Right?

And then you move on to doing manual stuff. And, you know, I had an AI agent basically find a critical, and the tool it was using was curl to look at web services. So I I don't think that it needs to be, you know, super complicated. Again, like, give the AI just enough tools and context and then let a human look at the results.

Brian FehrmanBrian Fehrman

Yeah, and then kinda go through, it's gonna be an iterative process too, right? I mean, it's never gonna be, you just set this thing up, and it's ready to go. I mean, it's gonna be this nice feedback loop of you give it the instructions, it finds results, you go through the results and review, and you're like, oh, hey. I noticed that there were these things that didn't get chained together, or, hey. I think it's missing this information, or maybe should've looked here a little bit more.

And then you start updating your prompts, updating scaffolding and harnesses around it, maybe adding in additional tools or abilities for it, or maybe constraining things as well too. You might find that it's it's starting to look at things you didn't want it to, which I found when when I was testing out, you know, some of our stuff recently against, you know, Black Hills. It it found things that were related to people who worked here but weren't actually Black Hills infrastructure. Oh, right.

Human and Al Collaboration

Derek BanksDerek Banks

It was a a GitHub repo. Right? It was, like, having Yeah. Yeah. When I first looked

Brian FehrmanBrian Fehrman

at it, I was like, what what is this thing? And then I went and I did a little bit more digging and then realized, oh, that was one of it's the the URL is, like, one of our testers names, but it's, like, broken up into different different chunks. It's like, oh, I see it now. Right. But yeah. So So things yep. Things like that, you just iterate through and kind of, you know, make better each time. Right?

Derek BanksDerek Banks

Also, I'd be interested to see it looks like the models that they were using, GPT four o, Gemini Flash, and o one. Well, I mean, those are a little dated at this point. I'm not saying it invalidates the research at all. I think the percentages now would be higher than 2164%, using, even, you know, like, right now, like, some open weight models that I've been testing with that are specifically have been, you know, fine tuned to do agentic, quote, long horizon tasks. I mean, they're they're quite good.

Improving Al Agent Workflows

They're not I I I think I still would say that it's not, you know, fully autonomous isn't ready for prime time. We're not gonna, you know, have the, you know, the sci fi, what is it, Neuromancer, everybody gets an AI that hacks kind of thing. Right? Not yet. Yeah. But it's probably coming. So Yep.

Brian FehrmanBrian Fehrman

Yeah. I'm looking at this. So this paper actually is, I guess, little bit dated a little bit older. So this is

Derek BanksDerek Banks

Well, you know how academic papers work. Right? Like, do the research.

Brian FehrmanBrian Fehrman

It can take it can take forever.

Derek BanksDerek Banks

And so it takes a hot minute sometimes. So they probably did this last year. Right?

Brian FehrmanBrian Fehrman

Yep. Yeah. It looks like a yep. Somewhere around there.

Derek BanksDerek Banks

Which also probably explains my earlier comment of why did you just give it Kali? Now they might take a different approach after, you know, the clawed code wave and the agentic coding wave would seem like a better choice to kinda go down the like, that kind of route now. But this time last year, that's not the route I was going down.

Model Limitations and Updates

Brian FehrmanBrian Fehrman

Yep. Yeah. And so another thing that's noted is that they had to use a role playing jailbreak to get the LM to complete the tasks, because the model's own safety filters were blocking it, which is to be expect expected. But, as we were discussing earlier, oftentimes, that just says telling or that just means telling the model that you're authorized to do something. And then if it says no, then you just, like, tell it that, no. No. Really. Really. I'm authorized, or this is hypothetical,

Derek BanksDerek Banks

or whatever. Well, that's what the Chinese were doing with in the Anthropic models. And and when they had their hacking operation uncovered, you know, Anthropic was saying that's basically what they were doing is essentially lying to the model and small things like, yeah, we're authorized to do this. I mean and, you know, the way LLMs work, it doesn't know. Right?

And so I I think that's interesting, but I will say that I mean, I use AI frontier models to hack, you know, all the time. And it's it's pretty much as easy as, saying, I'm on an authorized pen test. Like, if you're using a chatbot, if you control the system prompt and are using an AI, yeah, I don't even really have to do that. Right? I'm just my system prompt basically says, you know, we're we're we're hacking.

Let's do it kind of thing. And so I I haven't ran into an internal guardrail, but I know some of our other researchers have doing, like, vulnerability research, like, in the nuts and bolts of, like, windows. They've gotten some internal guardrail stuff. But Yep.

Jailbreaks and Model Guardrails

Brian FehrmanBrian Fehrman

Yeah. But, usually, it's just, yeah, just a matter matter of time of trying to trying to get around it. I mean, there are the providers are coming out with their supposedly vetted security vendor vetted offerings of where they they vet you and then allow you access to models that have less guardrails in place, but we still don't know. I mean, is that a matter if they remove certain system prompts, put in certain system prompt instructions? Do have they fine tuned the model to try to internally remove?

Are they doing their own ablation process? We don't really know at this point of what what that actually means, and does it make a difference?

Derek BanksDerek Banks

I think we get a different system prompt when we're using their apps. I mean, that would be the easiest thing. Right?

Brian FehrmanBrian Fehrman

Yeah. Yep.

Derek BanksDerek Banks

I I can't imagine that they have, like, a well, now you get this access to this other, you know, Ops 4.6. I mean, it's possible for sure.

Brian FehrmanBrian Fehrman

Yeah. But from, like, a financial and business perspective, how much it cost to actually make those changes.

Derek BanksDerek Banks

And, like, a routing your traffic versus someone else's perspective, like, yeah, I just I I don't know. But, I mean, just so far in 2026, I haven't had a whole lot of pushback from any model doing any kind of security work. I think it maybe I've just gotten used to, like, how I phrase my input. And and then also maybe it helps using, like, an ongoing persistent agent with memory that remembers things for me. I mean, that probably helps too.

Brian FehrmanBrian Fehrman

Yeah. The whole, it agreed to something earlier, so it'll continue agreeing to it.

Derek BanksDerek Banks

It knows who I am, what I do. I have a personal context portfolio and a Telos file like Daniel Measler, and, like, yeah, I I did. It it thing you know, I I think that, you know, when I start sessions, I think all that stuff gets sent to Anthropic. And so they're like I will say that it probably also means that, you know, at least in the case of, like, Anthropic or OpenAI, that, when your queries come in and they flag something because let's face it. They're probably doing classification on your input.

I would.

Brian FehrmanBrian Fehrman

Oh. Yeah.

Derek BanksDerek Banks

I mean Yep. How else do you catch the Chinese? Right? That doesn't mean you store it. That means they're just, you know, essentially running like an IDS kind of thing. If they flag my account and see, oh, this is part of Black Hills information security, we know who they are. They're not, you know, doing things against the law. They might let that slide where another count might get banned. Right? So that that's probably, I think.

But they actually did say to us that they lowered our guardrails. I'm like, oh, well, thank you. No. Thanks. I appreciate that.

Brian FehrmanBrian Fehrman

Thanks, Robert.

Derek BanksDerek Banks

Yeah. Yeah. So the other thing that was in here that I thought was kind of neat was and and probably something that we all knew, but maybe we need to hear is do you think that this kinda lowers the bar? Yeah. Lowers the bar for people being able to do things like hacking or coding or whatever?

Provider Controls and Trust Factors

And so someone who might not have had the skills to run a global ransomware campaign can probably do that now.

Brian FehrmanBrian Fehrman

Yeah. Yeah. It's a lot easier for them to put together all the pieces that they need to do for that rather than coming up with it from scratch. Because, I mean, coming up with that from scratch, that's I mean, how do you even start searching for that? Google, how do I create a global, you know, ransomware attack?

Derek BanksDerek Banks

It's like an office space, like, do I launder money? Like, I don't know.

Brian FehrmanBrian Fehrman

Yeah. Exactly. But now you can you got a source that'll kinda consolidate for you if you ask it nicely in the right way.

Derek BanksDerek Banks

Yep. So I think the the last part, the the so what? So 64% on real CVEs. I mean, I would keep saying this that you if you thought you had to keep things patched and up to date in the past, now so more than ever, especially externally, I think, you know, running things on the edge of your network is, something that I would imply very much increased scrutiny of, if it were me and you asked me or, my opinion on those kinds of things, I would say I would even do, you know, packet capture network traffic analysis on the outside of your network. That is how the Palo Alto, compromise.

Lower Barrier for Cyber Attacks

Was it last year or was it year before? Oh, the years run together now. That's how it was detected was running Zeke essentially on the outside and seeing that a Palo Alto firewall was running curl out to somewhere on the Internet and bringing back said thing, and that's typically not what you want your firewall doing, I wouldn't think. No. And and so, you know, it's not perfect, but at least, you know, defense in-depth kind of kind of thing.

And then also, if you're a red team or if you're on the offensive side and you're not using AI to make your life better and easier and more efficient, you're missing out.

Brian FehrmanBrian Fehrman

Oh, yeah. Abs absolutely. It's a it's a necessity at at this point. It's just it's going to become the expectation, and if you're not utilizing it, you're you're certainly gonna fall behind without a doubt. It's already happening.

Defensive Security Implications

Derek BanksDerek Banks

Yeah. And if you're in the unique position like Brian and I where you're trying to build an agentic AI penetration testing platform, certainly a lot easier than it's or a lot harder than it sounds. It sounds easy because, you know, you see on Twitter, everybody and their brother has some, you know, agentic AI penetration testing framework out. But now I I would say that, you know, if you're building such a thing, I would, early on, introduce some kind of benchmark, because it's hard to determine, is it not finding anything because this customer is really good, or is it not finding anything because this is broken?

Brian FehrmanBrian Fehrman

Yeah. Yeah. It's good to it's good to have some, some scientific results. Right? So you don't have moving targets and a lot of unknowns. So that way, you can truly see is this getting better and how how can I improve it and how well does it actually do? So I think benchmarks are absolutely essential. And it sounds like we've got, you know, the the benchmark that we've spoken about here, and I'm sure we'll have plenty of others on the horizon that will that will come out.

Derek BanksDerek Banks

Alright. So I guess, we'll wrap it up and, say, fully autonomous AI hacking is still mediocre. Human in the loop AI hacking is already good and and getting better. Like we said, if it was already good with o one and g p t four o, well, those are those are so last year.

Why Red Teams Need Al Now

Brian FehrmanBrian Fehrman

So yesterday.

Derek BanksDerek Banks

And so, yes, stuff's moving fast, so try and keep up.

Brian FehrmanBrian Fehrman

With that, I hope you enjoyed, and keep on prompting.

Transcript source: Provided by creator in RSS feed: download file
For the best experience, listen in Metacast app for iOS or Android