Is AI ready for DevOps?

⁠¶ Intro

Bret

00:10

Welcome to the first episode of my new podcast, a Agentic DevOps. this episode. Is kicking off what I think is going to be a big topic for my entire year, probably for the next few years around wrangling AI into some usable format. For DevOps, you probably heard of AI agents by now, or the MCP protocol. I guess I should just say MCP, since P stands for protocol. And these two things together are creating potentially something very useful for platform engineering, DevOps, and that stuff.

00:47

it has so much potential that. In the first quarter of 2025, I kind of thought this was gonna be a big deal. This was gonna be, uh, if we can figure out how to keep these things from hallucinating and going crazy in our infrastructure, this could potentially be the AI shift for infrastructure that I was waiting for. So started this podcast. We recorded our first episode at KubeCon at the beginning of April, 2025, and this is gonna be a series of very specific episodes around getting.

01:21

Ais to do useful automation and work for DevOps, platform engineering, infrastructure management, cloud, you know, all those things beyond just writing YAML, right? So the, the intro for this podcast, there's a separate episode for intro. It kind of goes into my whole theory of why I think this is gonna be a thing. And this episode we really try to break down the basics and fundamentals for those of you that are catching up. Because it's a lot. There's a lot going on.

01:47

It seems like We have announcements every day this year around AI agents or Agentic, ai, however you wanna call it. I am calling it Agentic DevOps, and hoping that name will stick. Now, this episode, since it's from the beginning of April. And it is technically now just getting released at the beginning of June. We're a little bit behind on launching this new podcast. Um, I think everything in it's still relevant. There's just been a lot more since. And I don't know the frequency yet.

02:15

I don't know how often this podcast is gonna happen. It could be potentially every other week. It could be weekly. I just don't know yet because we are not gonna do the same thing here as on my usual podcast. If you're someone who knows that one DevOps and Docker talk that I've been doing the last seven years, that one is still gonna have AI in it.

02:31

But this one is very specific and there might be a few episodes that have syndication or whatever you wanna call it, of the episodes on both podcasts. But most of the time we're gonna keep the focus of just everything, DevOps, everything, containers on the DevOps and Docker talk show. And this one is gonna be very specific around implementing useful AI related things for Agentic DevOps, or automating our DevOps with robots. So I hope you enjoyed this episode with Nirmal from KubeCon London.

03:10

Hey, I'm Bret. And we're at Kon. We are Hi, Nirmal.

Nirmal

03:15

I'm Nirmal Metha. I'm a principal specialist solution architect at AWS and these are my views and not of my employers, but this episode is all about

Bret

03:24

Nirmal

03:25

agents

Bret

03:26

for DevOps and platform engineering. Ooh. So let's just start off real quick with what is an AI agent? Okay. So we've heard of ai, we know ai, gen, AI chat, GPT. We've talked about. running LLMs, running inference on platforms. Yep. And that we are managing the workloads that provide other people services. Absolutely. So how is AI agents different than that?

Nirmal

03:51

This is a air in terms of bleeding edge. Yeah. This is it, right? Yeah. Like we're a year ago. No one

Bret

03:57

had this

Nirmal

03:57

term

Bret

03:58

six months ago. I don't think anybody's

Nirmal

03:59

talking about it. I'm very few people. Yeah, very few people. and we've seen it in the news a lot of vendors and big companies announcing Agentic ai, that's another term's ai, so AI agents, Agentic It's giving your LLM, like your chat, GPT or your Claude or local LM Lama. Yeah. Access to run commands. On your behalf. Or on its behalf.

Bret

04:27

Yeah. And we call those tools like that, if you hear that word. Tools. Yeah. That's like the generic tool, like I guess a shell. Could be a tool. Correct. Reading a file could be a tool. Accessing a remote, API of a web service is a tool. Yep. Searching could be a tool. And so these tools what what makes that different than what we've been seeing in our code editors?

⁠¶ Understanding AI Agents

04:50

Yeah. How is that different?

Nirmal

04:51

I'm a platform engineer and I want to build out an EKS cluster using Terraform. That's what we use. So I'll ask let's say Claude or chat GBT. Yeah. I'm a platform engineer and I want to build a production ready EKS cluster. Please create. The assets I need, and it will spit out some Terraform. Yaml, right? Yeah.

Bret

05:12

And it's writing text.

Nirmal

05:13

It's writing text. And I can, I'll double you a little button. I copy that. Put it in, or there'll be, if you're using Cursor, all these other tools, you can put it into some TF file. Yeah. I can then take that and I can ask the LM what's the command that I need to run to apply this Terraform? To actually stand up the, what it's, what's described in this terraform. It'll spit out, okay, you wanna do Terraform plan and then Terraform apply and all that.

05:38

Terraform in it or whatever, and I'll just copy those commands and check 'em and write them myself. So the LLM is not executing anything on my behalf. On, on your behalf. Agent would be defining a tool set. So I could give, I could define a tool called Terraform or a tool called Shell I could describe what that tool does in natural language.

Bret

06:05

Okay.

Nirmal

06:05

And then I can give the LLM system a list of these tools and their descriptions. And tell it. Okay? Back to the same scenario. I'm a platform engineer and I want to create an EKS production cluster using Terraform, and I want you to create it right for me because it has the access to those tools. Now it internal reasons, okay, I need to create some Terraform. I need to validate it in some kind of way, and then I need. I need to execute this Terraform. Is there any tools that I have in my toolbox

Bret

06:40

In this case, sorry the i is the, you're referring to yourself as the ai, right? Yeah. Sorry. It's no longer the human doing this, right? No. We gave it instructions and we sit back

Nirmal

06:48

from the perspective, from the perspective of the, LLM the Gen AI tool itself, the LLM system that's the I in this scenario. Yeah. I, the LLM is deciding. The Gen NI tool is looking at its list of available tools and matching what it needs to it, figure it, it's reasoning about what the end goal is and it looks and says, there's this tool called Terraform that allows me to use infrastructure as code to deploy resources on the cloud. That sounds like what I need. Maybe. And it.

07:24

Generates the terraform just like it did the first time around. It knows what command to run. It generates the command and then the magic here, a little box will show up and says, do you want me to execute this on your behalf? You click the button, you click the button, and then it executes that Terraform apply Uhhuh and it sounds very simple, but it's a very different paradigm in terms of thinking about how we interact with infrastructure or systems in general. Like broadly systems in general.

08:00

Because we are no, like in this way of looking at it or thinking about it, I, as the human, are no longer executing those commands. I am. Trusting to a certain extent that the LLM can figure out what it needs to do and giving it a guardrail set of tools to use and execute.

Bret

08:23

Yeah. And so we're giving the, we're giving the Chaos monkey XI mean, it's automation, right? We could actually classify this as just automation. It just happens to be. Figuring out what to automate in real time. Rather than the traditional automation where we have a very deterministic plan of, steps that are repeated over and over again by a GitHub action runner or a CI CD platform or something. Yeah.

Nirmal

08:43

And the agent part is the piece of software that enables. The LLM to execute.

Bret

08:51

Yeah.

Nirmal

08:52

and pull, pulls this all together and one, so back to what I was talking about with the infrastructure and there was a part where I said, okay, how do we define what tools are available for the agent system to use? and how do I want the agent to call those tools? And reason about them, and there's a protocol called MCP Model Context Protocol. Just outlining a standard way of defining the tools, the system prompt for that tool and a description.

Bret

09:25

And this is like an API where you like define the spec of an API.

Nirmal

09:27

It's a defined spec of an API and the adoption of that API is

Bret

09:33

just exploding right now,

Nirmal

09:34

essentially.

Bret

09:34

Yeah. So we're to, to under if you're not, okay sorry, lemme back up a second. That's a very valid point because that's the reason I wanted to record This's a I don't wanna be a hype machine. Correct. But I'm super excited right now. if you can see inside my, in my enthusiastic brain, I've only been paying attention to this for a little over a month. If you asked me two months ago what an AI agent was, I'd say, I don't know a robot that's ai. I don't know.

09:59

I now think I've got a much better handle on this. I've been spending so much of my life right now, deep diving into this, to the point that you and I are talking about changing some of the focus this year on, on all these topics. Absolutely. Because I think this is gonna dominate the conversation. This is, these are, there's gonna be a lot of predictions in this and we're not gonna talk forever 'cause it's gonna need to be multiple episodes to really break down what's going on here.

10:20

But we now have the definitions. AI agents, what are tools? The protocol behind it is essentially MCP right now. Although that's not necessarily gonna be the only thing, it's just the thing right now that we're agreeing on by one company. Exactly.

Nirmal

10:33

We have to caveat this with, this is like this is early like Docker days. This is like

Bret

10:40

Docker in day 60, right? Yes. Like we were like right after Python in 2013 when we gave that de, when he gave that demo, Solomon. Like we all saw it and didn't understand it fully, but it felt like something right. And like you and I both, that's why we were early docker captains, is we saw that as a platform shift. we've seen these waves before over, over our careers of many decades that we earned with this gray beard status with effort and toil. And I feel like this is maybe the moment.

11:13

That was the moment of 2013 and that, and yeah, I'm not alone in that feeling. yes.

Nirmal

11:18

And there's just to be clear, there's massive differences between like paradigm shifts in terms of like virtualization, cloud containers. And the tooling of software development and systems development

⁠¶ The Future of AI in DevOps

11:31

and right systems operations, it's still in that same vein, but. Yeah, we're not replacing,

Bret

11:36

this is not replacing infrastructure or containers or anything like that. This is just gonna change the way we work.

Nirmal

11:41

Correct. And also it's broader than just like IT infrastructure. Like this has implications with software design or application, like what an application does. And I want to think of this as a teaser trailer. To subsequent new series, episode. A new series. Yeah, absolutely. We're gonna have to

Bret

11:57

come up with a name. I'm toying around with the idea of Agentic DevOps, and just classifying that as the absolutely as the theme of certain levels of podcast episodes. You've heard it here first. Heard it here first. This

Nirmal

12:07

is Agentic DevOps. Another term we're seeing is AI four ops. Again, this is early days. None of this is like

Bret

12:13

Yeah. Set in stone at all. Yeah, and if you're at KU Con today with us, if you were here at this conference all week, AI was a constant topic, but it wasn't about this. It actually, there was only one talk in an entire week that even touched on the idea of using AI to do the job of an DevOps or operator or platform engineer. Like people are, what we're talking about at KU Con for the last three years has been how to run the inference and build the LLM models.

12:40

And so we are just still using human effort to do that work. But this, I feel like I'm gonna draw the line in the sand and say, this is the. month or the definitely the year, that kicks off. What will be a multi-year effort of figuring out how we use automated LLMs Essentially with access to all the tools we want to give it with the proper permissions and only the permissions we want to give it right to do our work for us. In a less chaos mon monkey way, right? Like less chaotic way.

13:13

Potentially. Potentially. It could, this thing can easily go off the rails. Absolutely. I will probably reference in the show notes Solomon Hike's recent talks about how they're now using Dagger, which is primarily A-C-I-C-D pipeline tool. So he's talking, and a lot of my language is actually from him iterating on his idea of what this might look like when we're throwing a bunch of crazy hallucinating AI into what we consider a deterministic world. Correct.

Nirmal

13:40

I think with containers and cloud and on the infrastructure APIs we have. We were chipping away and really aiming at deterministic behavior with respect to infrastructure. Ironically, maybe not ironically, I don't know. Now we're introducing a paradigm shift that reintroduces a lot of non-determinism right into. A place that we have been fighting to non-determinism for a long time.

Bret

14:10

We have been working to get rid of all that. And now we're, that's why I keep saying Chaos monkey, because we're throwing a wrench into the system. That is in some ways feels like we're going back to a world of, I don't know, what's the status of the system? I don't know. and this will probably be another episode, I feel like this Agentic approach where we're actually can have the potential to pit. The LLMs against each other, right? And have different personas of these agents.

14:34

One is the validator, one is the tester. One is one is the builder. And they can fight amongst each other. And it all works out. It actually ha happens to actually work out better. And so if you're like me and for the last three years of understanding, ever since GPT. 3.5 or whatever came out.

14:52

We all saw chat GPT as a product, and then we started with GoodHub copilot and we started down this road As a DevOps person, I haven't had a lot to talk about because I'm not interested in which model is the fastest or the most accurate. 'cause you know what? they all hallucinate and still even today, years later. Code agents and we and you can see this on YouTube, you watch basically thousands of videos on YouTube of people trying to use these models to write perfect code and they just don't.

15:21

And so we in ops, but we look at that, I think, and the people I talk to even for years now are like, we're never gonna use that for ops. But now my opinion has changed. Yeah.

Nirmal

15:32

yeah. And I. If you're listening to this and your gut reaction is, wait we have like APIs that are deterministic. Like you just

Bret

15:40

Yeah.

Nirmal

15:40

We can just call an API. We can have an automation tool call an API to stand up infrastructure and like, why do we need to recreate like another layer that makes it non-deterministic. And looks like an API but isn't an API and you don't really know what it might do or which direction it might go. Yeah. And you're feeling I don't know. That doesn't seem like it would solve any problems for me. And it seems like it might introduce a lot of problems.

16:06

You're in the right place because that's exactly what we're gonna explore.

Bret

16:09

Yeah.

Nirmal

16:09

one thing for sure though is it's here, right? I and so I feel like as good engineers, as good system admins and operators

Bret

16:20

are we enjoy, we love our crafts. We, we look at this as an. Art form of brain power and Right. Reaching for perfectionism in our YAML and in our infrastructure optimization and our security.

Nirmal

16:32

And we have a healthy sense of skepticism on new tools, new processes, new mechanisms. Yeah. When you, when availability of your services is paramount and reliability, you want to introduce new things in a. In a prudent manner. And so we're gonna take that approach, but we're not going to dismiss that this exists. Clearly there's a lot of interest, energy integration happening, experimentation happening and some people are already starting to see value.

17:05

Yeah. and we're gonna explore with you where that, goes. Bret (2): Yeah. This, just to be clear, this is KubeCon April, 2025 and almost no one is talking about this yet. It feels like it's right under the surface of a lot of conversations and a lot of people maybe are thinking about it, but I'm not even sure that we're honest with ourselves around. That this is coming, whether we like it or not. And only because, yeah, not only, but one of the large reasons is business. Okay. Lemme back up.

17:40

You know how in a lot of organizations, Kubernetes became a mandate, right? So there's lots of stories that came out over the course of Kubernetes lifetime of teams being told that they need to implement Kubernetes. It didn't come from a systems engineering approach of solving a known problem. It came down. Because an executive decided that they read a CIO magazine article that said Kubernetes was a cool new thing and they did it right. I hear this all the time.

18:06

I confirm this multiple times this week with other people, and I now feel like we're not talking about it yet. But I did hear multiple analysts say their organizations that they're working with expect that we are going to reduce the number of personnel in infrastructure. Because of ai. the only way that's possible is if we use agents to our advantage, because we can't, yeah. I still don't believe we're replacing ourselves. I don't think the agents will ever in, in the near term.

18:38

And as far as we can see out, let's say five years they will, they won't be running all infrastructure in the world by themselves. They can't turn on servers.

⁠¶ Concluding Thoughts and Future Episodes

18:47

They maybe you can actually pixie boot and do a power on a POE or whatever, but. Like we still need someone to give them orders and rules and guidelines to go do the work, but to me, I'm starting to wonder if very quickly, especially for those bleeding organizations that are looking to squeeze out every cost optimization they can of their staff, that they're going to be mandated to not just take AI as a code gen for yaml, but to start using these agents to.

19:16

Increase the velocity of their work . And my, one of my stories is over the last 30 years I do this in talks is every major shift has been about speed, cost reduction in speed. Sometimes we get 'em both at the same time. Sometimes they're one or the other. We get a cost reduction, but we don't go any faster, which is fine, or we're going faster, but it's not necessarily cheaper yet. Right.

Bret

19:36

And. I feel like this is maybe the next one where We're gonna be feeling the pressure because all the devs are gonna be writing code with ai, which in theory is going to improve their performance, which means they're writing more code, shipping more, or need, or wanting to ship more code, potentially. And if we're not using AI ourselves.

19:56

To automate more of these platform designs, platform build outs, troubleshooting when we're in production and things are problematic and we don't wanna spend three hours trying to find the source of the problem. If we're not starting to use agents to, to automate a lot of that and reduce the time to market, so to speak, for a certain feature or platform feature then I don't think these teams are gonna hire more of us to help enable the devs to deploy.

20:22

What it could end up happening is we end up more with more shadow ops, where the developers are so fed up with us not speeding up to the, if they're gonna go 10 x we have to go 10 x. Yeah. If they're gonna go three x or whatever the number ends up being in the reports. And Gartner puts out like the AI makes it efficient, more efficient for developers to, to code with ai. And the models get better and the way they use it is better.

20:43

And so they're shipping code faster and they can do the same speed with three times less developers, or they can just. Produce three times more work, which I think is more likely, because if it's the common denominator and everyone has it, then that means every company can execute faster and they're gonna, they're gonna want to do that because their competitors are doing that. So that's a's, that's a very loaded and long prediction.

Nirmal

21:03

That's a hypothesis. It's, I think there's a lot of predict here. It's gonna take some time for us to even chip away at that hypothesis, but it's a good starting point. If we're, but assuming that is like the hypothesis that organizations are looking at to adopt these tools that's a great starting point for us to help you figure out. what they are, why they are, what they do. Yeah. And how to use them.

Bret

21:27

This is this, by the way, a lot a little bit of that opinion of mine, and there's more to come 'cause I've got a lot more written down than we're never gonna get to. But a significant portion of that is actually coming from what I've learned this week from analyst whose job it is to figure this stuff out for their organization and their customers. Interesting. And so I, I am a little weighted by their. Almost unrealistic expectations of how fast we can do this. 'cause we are still humans.

21:55

An organization can't adopt AI until the humans learn how to adopt AI and the humans have to go at human speed. So we can't just flip a switch and suddenly AI is here and running everything for us. At least not until we have Iron Man's Jarvis. Or whatever. Like until we have that, we still have to learn these tools and still have to adapt our platforms to use them. Yes. And adapt our learning to use them. And that's gonna take some time

Nirmal

22:16

and. I'd like to, like the parting thought for this is Okay. And here, like you said, there's an under the surface kind of thing happening. Yeah. So whispers,

Bret

22:26

it's almost like murmurs and under

Nirmal

22:28

the surface. Yeah. AI agent, AI agents, mag

Bret

22:32

DevOps. Ooh. This is our ASMR podcast. Moment of the podcast.

Nirmal

22:37

Like MCP protocol.

Bret

22:38

Yeah.

Nirmal

22:39

you mentioned HA proxy on the previous podcast, about load balancing and figuring out the street, like token utilization of GPUs and tokens and all that stuff. and we had a conversation at the solo booth and they were talking about having. A proxy for an MCP gateway, one of the things that we're seeing the early signs of is these new workloads, right?

23:01

This agentic kind of thinking Around even just executing the agentic platform, if you will, And everything from looking at the tokens and optimizing load balancing to inference endpoints or MCP is, doesn't behave the same way as like just an http connection. Necessarily. And solo. We were talking to them and they have an MCP gateway. We're seeing a little bit more of a trend on AI gateways.

23:28

Is DO the project has an AI gateway and so this is not just another workload and looks like just a web server. And the networking and everything is gonna be different. Not dramatically different, but We'll, but drift different enough that we need to be aware. 'cause even if you're not using any of these tools, someone in your organization is probably gonna say, oh, we need to integrate this stuff into our software, to our right. Whatever we're delivering.

23:54

And we'll need to know it even at that layer. So we're gonna also cover that component as it relates to. The Kubernetes ecosystem, right? And cloud native.

Bret

24:02

Yeah. I think this, if we had to do like an elevator pitch for this podcast, it would be we now have a industry idea around these terms agent, and then it uses an API called MCP to allow us to give more work. To these crazy robot texting things that we have to talk to in human language and not with code, right? It's running code, but we're not talking to it with code. And that it can now understand all the tools we need to use and we can just give it a list of everything I wanted to use.

24:34

here's my Kubernetes API, here's all my other things that I, you have access to, and here's my problem. Go solve it. And that paradigm. Three months ago, two months ago for me, I didn't know existed. And that's why I've been sitting on the sidelines with ai. Like it's cool for writing programs that mostly work in a demo. It's cool for adding a feature to something I already have, but it's not doing my job as a platform engineer or DevOps engineer.

25:03

It's just helping me write text faster Then I can type into my keyboard. And that was not that interesting. That's why you didn't see a lot of me talking about that on this show, was it just wasn't that interesting. This is an interesting topic for ops and for absolutely engineers on the platform.

Nirmal

25:17

Yep.

Bret

25:18

Nirmal

25:19

stay tuned. Yeah. And I, I love crazy texting robots. Crazy

Bret

25:24

texting robots. Maybe that's the title. TBD. Alright. Alright. See you soon, man. See

Nirmal

25:32

you. See you. Bye. Bye.

Transcript source: Provided by creator in RSS feed: download file

Episode description

Transcript

⁠¶ Intro

⁠¶ Understanding AI Agents

⁠¶ The Future of AI in DevOps

⁠¶ Concluding Thoughts and Future Episodes