The Role of AI in DevOps: Observability, Security, and Efficiency - DevOps 194 | Adventures in DevOps podcast

00:14

Us going on. Everyone, Welcome to another episode of Adventures in dev Ops. I'm your host for today, Will Button joining me in the studio my co host back for what is this episode number three? Warren, let's call it three. We'll call it three. Yeah, right on. I'm excited. It's starting to look like a trend here. Yeah. You know,

00:36

I'm Warren at CTO at Authors. Thanks for inviting me on. I know I said I promised just one or two episodes, so you know, I think this trend is going in a good direction, and I hopefully will be on future ones as well. Cool. I hope so, because I'm enjoying having you here on the show and joining us. Also in the studio today we have the DevOps app activist from Dina Trace, the Cloud Native Compute Foundation

01:07

ambassador and host of the Pure Performance podcast. And I can validate this because if you're not familiar with the podcasting industry, there is actually a secret handshake used by all podcasts hosts around the world. And I did validate this, so I can confirm that this is actually the legit non chat GPT. Andy Grabner joining us on the show. Welcome, Andy, Thank you so much for having me I know that AI technology is pretty good already, but I'm

01:40

not sure if they if they can simulate my accent. Yeah right, And if you're joining us on the livestream, you can actually see Andy as well as Warren and myself, and we can do the test as well to show that it's a real legitimate background not behind not just an artificial background behind us. My my leads move here. Yeah, I can walk behind my kitchen counter, but I need to take off my headset quickly and I'll show you right, Yeah, same thing. There you go. Perfect successful test can

02:20

confirm that we are not a robot. Cool. So welcome to the show. So you've been doing the Pure Performance podcast for seven years? Is that right? Yeah? I think it's about seven years, even maybe almost close to eight years. We had last week I believe it was the two hundred and first episode. Oh nice you have released. Yeah, and yeah, it's been. It's been exciting and it's been if you look back, it's

02:49

actually quite astonishing that we've been doing this for that long. And with we, I also mean Brian Wilson, my Cobles them myself, so Brian, it's not here today, but yeah, thanks Brian, Right, so for anyone who's not familiar with your podcast, give us the short summary of what you talk about on there. Yeah, So we initially started it because both Brian and I we've been in the performance engineering space for twenty plus years now.

03:19

It's been a long long time we've both worked at Dana Trace. Even though the podcast is not related to the product or the services we sell, but it's basically based on our experience and what we see out there. So we started initially to talk about what are the performance problems that we see in modern day applications. You know, why are applications crashing, why are applications

03:38

slow? What are the most common things that we see. How can we educate our listeners which we initially thought of mainly developers and our and performance testers. What can we teach them? But over the seven years things have changed, and because our fields of topics has changed and evolved, and we moved from performance engineering to develops, to cloud native to kubernatives, to platform engineering.

04:04

We covered a lot of range of topics and we also had over the years many interesting guests, a lot of practitioners also people that you see on big stage in the IT So, yeah, it's been quite interesting and I think a broad range of topics and hopefully there's something in there for everyone. Right on, So you mentioned that you said your podcast audience initially was performance

04:34

specialists and practitioners, and you've seen that change over the years. Yeah, because also we changed, right and with we I me and Brian and I we have changed. And the reason why we changed is because obviously we you know, day in and day out, we deal and we work with our community that's the dial traized community, the opstability community. And because our product

04:58

and capabilities have evolved, so have the people we interact with. They've changed, and we see nowadays more and more you know, enterprise architects, We see cloud architects, We see people that are responsible for big Kubernators clusters. We see people that are very heavily engaged in the CNCF community like I am in the Cloud Native Computing Foundation community. So we're inviting a lot of kind of my counterparts to the podcast and discuss topics that are relevant in the Cloud

05:28

Native ecosystem. And so yeah, that's why it has changed. But we always try to come back and kind of relate it to performance because in the end, we all want to make sure that we help the global community to build performance and resilient systems. What was new eight years ago? Like, what was the topic of conversation that what was performance important? You know, what were you talking about back then? I think episode number one, two

05:58

and three, or at least the first handful. I believe we're all focusing on what are the top performance problems in Java and dot net applications because these were kind of like the two predominant technologies we were still dealing with, right and we still are seeing a lot of Java dot Net, And because Brian and I were doing a lot of load and performance testing and a lot of performance analysis on these type of applications, we said, let's discuss what are

06:25

the key top the key things we find in applications in Java and dot net applications that make applications fail crash, be slow and talk about it and educate our listeners. So you have a secret love for the CLR, then yeah, I would. Maybe the word love doesn't come to mind as a birthday sure, yeah, But I mean the interesting thing is, right, it's in the end, everything we discussed and all the patents we found, Let's

06:57

say most of them are not limited to a round time. Well, let's say you only see certain patterns in Java or dot net, because patterns are patterns, and they happen whether you are developing. We just did an episode recording that I think will come out next week on mainframe, right, so we also cover mainframe, and we see similar patterns that lead to problematic software in the mainframe versus Java, dot Net, and now going all the way

07:28

forward, things we see in the highly distributed world of kubernets. So same patterns. I want to ask about that real quick because one of the patterns I've seen throughout my career is a tendency to blame the stack for performance issues. Like a really common one is, oh, we wrote this in Java, but that's not performance, so we're moving to dot net or you know, insert any other any other technology you want in there. You have you

08:07

seen that? Yeah, I mean the easiest for humans is to blame somebody else, right, And the reason why that is, I believe is because we as an industry, I think we try to make it as easy as possible for engineers to write code. And easy means we are abstracting things away. We're abstracting the complexity a way. And therefore, when you abstract things away, you allow people to build things very fast in a generic way,

08:39

but not optimized or optimizable for the specific use case. And so when people say this would be fifty times faster, or if I moved to random x or y, then I would argue that this would only be This is only true if you actually change the implementation underneath, right, if you are using you know, if you if you're replacing the Hibernate with the A Development Entity framework, you will still run into the same problem because these are all very

09:07

generic frameworks. But if you don't use them and optimize them for your use case, you will still end up with the n plus onequibit problem capturing and fetching too much data that can gets into memory that again gets filtered in Parston memory, which leads to too much memory usage, high garbage collection, high CPU bad performance. So my answer to this is, if you want to build really high performance applications, you need to understand what's actually happening in your

09:33

stack. Right, How you know what type of data do you need, where you get the data from, how do you get inefficientcy what type of data do you even need because the otherwise you will end up with the same problem in any run time. It doesn't matter. Yeah, the best case, even if it does fix whatever hypothetical problem you have at that moment, you're probably just creating a whole bunch of other ones that are not seen because of the swage, right, Yeah, yeah, I agree with you.

09:58

And you know, I think we are living in an interesting situation with now. When I kind of grew up, sounds like strange. When I went and I got educated on software engineering. I started in high school. I was lucky enough to have an education system here where one of our local high

10:16

school was specialized in software engineering. And we learned assembler, we learned see, we learned the core, and then we we learned like how to move bits around and and you know, and how to how to manage memory yourself and and these are traits now that that I think are probably I'm not sure if they're still taught, but I think the necessity of if somebody comes in new and wants to become a software engineer, especially because of such a high

10:46

demand, I will probably not start with the assembler. I probably won't start with with managing my memory myself. But I will start with a higher abstraction because I want to be productive. I want to contrue you, and with that it means that you know, I may miss things and I do things that later on will lead into a problem because I don't know the underlying stake

11:11

and what's really happening. I've seen the problem go the other way though, right, like starting to play directly with threads or whatever the underlying is because you think you can do it better. And maybe today it is true, especially in some larger organizations, But then the industry or open source or whatever the language you're using framework of choice evolves over time and comes up with a much better agreed upon strategy than whatever you're using. You must have seen that

11:37

as well. Yeah, I agree with you, And I mean trading is another perfect example, right, I mean that's in many horror horror stories or problematic source. And by the way, I don't want to make the impressional gift impression that I know how to really in the highly efficient code. I've just seen many examples from other people that have built it, and I could help them identify those issues. So if you listen to this, don't hire

12:05

me to build that system. When you say when you just you know, everyone's on the same page, when you say you know performance or high performance, are there sort of numbers that go along with that that give us a good viewpoint into what that actually looks like. You know, this can be anything from systems that can handle i don't know, thousands or millions requests per

12:28

per minute, per hour, highly highly transaction volume. But this can also mean systems that have lower transaction volume, but that need to process terabytes of data. You know, patch processes if you think about it, and patch processes have to finish in a certain time because otherwise you may run out of a window of opportunity. Or there's even regulations where like especially in the finance and the energy sector. Right now, I've just worked with the cloud client

12:54

there. They have to finish processing all of their transactions, do it in the patch at a certain time because otherwise they're not allowed to trade the next day. And so this is this also this this may mean they don't have let's say millions of transactions, but they have transactions that are bigger in data that needs to be processed and validated, and you know, external systems need to be contacted and updated so highly performance systems really means whether you can deliver

13:26

what your end users are expecting from your system. And end users can be real users that try to book an online ticket at a show that is highly on demand, or it can be you know, other systems that call an API and expect when they call you API that it will deliver the result in a certain amount of time, and that time could be seconds, but it could also be you know, maybe an hour. But this is certain expectation

13:50

to a system. You triggered my curiosity there with the regulation, like there must be different observability importance that happen in as something like that, where it's not about necessarily the reliability of individual pieces or requests that come in, but over the whole data set. Like, that's not a world that's really familiar with me. So any insight you got there it would be super interesting.

14:13

Yeah. I mean, the one thing that comes to mind here, it's really we talk about data integrity obviously, right, I mean from an obstability perspective, if it's not just that the indust as you said, individual pieces of code that get executed to do certain things work, you know, I mean, obviously they should work reliably but in the end, the ends to

14:33

end business process needs to produce a result. So, for instance, a very simple example, if I would wire you money, then I want to make sure that if I wire you money and it goes through, it ends upon your bank account. If something great, If for whatever reason you know, something happens in the middle and something crashes, that the money doesn't just end up some where in a random account. So I think this is also

15:03

very important that we think about end data pipeline observability. This is another big topic business process monitoring. There's different terms I think for these things, but in the end, you want to make sure that whatever you do, you end up with the results that you expect, and that proper resiliency and proper error handling is taken care of, and at the end the data on both

15:26

sides of the system is valid and accurate. Makes sense. So you mentioned that some of your listeners are architects, and it sounds like some of the performance conversation has been shifted from reactionary to to further up in the design process. How do you define or what are your thumb rules for design finding a performance application without getting into premature optimization. It's a good one, and then Wren is smiling. So I think what I learned, and I remember what

16:18

I said earlier. When I went to high school, I was educated as a software engineer, and we were never educated unfortunately back then, to think about how to measure and observe if my software is actually doing what it's supposed to do, because it was in the mid nineties and sure be road log files, but we didn't never think about metrics and traces nowadays and things like

16:41

that. So what I can advertise now or advocate for, and what I see is observability as a requirement, obserability driven development, which means if I'm building a new service, if I'm building a new app, if I'm building a new function, whatever it is, I should not only care about what this function does, but also how I can validate and measure where the function,

17:07

you know, execute in the expected time with the expected resiliency. So that means what happens if something, if if my dependencies fail, how do I react. So it's really about thinking with observability as a non functional requirement and because first of all, it helps me as an as an engineer too later on validate that my stuff actually works. Is expect that it helps me if something doesn't work as expected, that I can troubleshoot if I have the

17:37

logs that traces the metrics that the need. But it also helps me to justify my work because if somebody comes to me and say, Andy, we need a new social media platform. Right, we have already too many of them. But let's say social media platform, and what's the measure of success. How do we, you know, differentiate ourselves with the others that are out there. So we want to have some metrics, and one could be better user experience, faster posting, easier API, so that others can integrate

18:07

better with my social media platform. And then I need to say, if I'm developing it, I also want to make sure we can measure that the system is really used. And the hypothesis that we have that our system will be adopted better because we are faster and more resilient, it's actually, you

18:25

know, the hypothesis works. And so what I see now more and more is this, you know, pushing observability into the minds of architects, of developers early on, of people that build frameworks, of people that build new libraries. That observability is baked in and not an afterthought. I think that's

18:45

very important. Yeah, for sure. I've spent many decades at this point working in early stage startups, and I think that's been one of the most critical mistakes in a lot of the startups I've been involved with, is failing to define what I call it the success criteria. So you build this product and you launch it, but then you don't have this definition of success to tell you that your product is being being consumed, utilized, and implemented the

19:21

way that you thought it was going to. And so now you're off building your company in this direction, and your potential customers are off going in this direction and you never identify that gap, which ultimately is bad for your startup. Yeah, exactly, because and especially at the startup, you have obviously limited resources, and you need to make sure that you are putting your resources

19:41

where you have the most impact. But if you cannot measure the impact, and if you don't know where what's your north star, like where you need to go, then yeah, and you're just running potentially in circles. Yeah. So you run out of cash, yeah exactly, and then if you're lucky enough, you sell whatever you have left and start something new and a

20:02

serial entrepreneur. It's also great. I mean you run into the problem sometimes there though that you're successful in spite of your mistakes, and then you get to tell a great story about all the things you did right and then people go and copy you. True. Yeah, yeah, yeah cool. So in your work over at the the C and CF, what does a what

20:26

does a typical day as an ambassador look like there? And how does that relate to your podcast, because I think there's probably a correlation there of your public presence versus or your public presence aligned with an open project like C and CF. Mm hmm. Yeah. So the first of all, I'm really

20:49

grateful that the CNCF gave me the opportunity to be an ambassador. And just for those people that don't know how this works, you have to apply to become an embassador, and investador really means to work the community, to help the community to grow, to help new members that come in, to find the way around, to help them contribute, become just part of the community, right. And you also have to reapply for the investadorship because it's always

21:14

a year. So we just had to reapply and now we need to keep fingers crossed that we did everything right. Now, how does this relate to my podcast? Since I've been an ambassador, I've obviously been more connected or better connected with project leaders in the c and CF. And for instance, we just had an episode of pe performance around the open source project kepla a C and CF project. We also just recorded an episode on our Goal,

21:45

which is a very prominent player in the C and CF. And so me being an embassador gives me the chance to easier connect with these different project teams, with the project leads, with the maintainers, and then I offer them platform to promote their work. Then also hopefully help with this grow the CNCF ecosystem and the CNCF community and educate them. It's a win win situation because in the end, you know, if you look at the CNCF, it's

22:18

a big, big landscape. If you look at the CNCF landscape, our industry will try to figure out how we build platforms on top of Kubernator is the base platform, and then picking the right tools and so my goal is to educate on how to build platforms in the end that make people successful. So platforms that allow developers to push out code faster. The big topic for us is always observability. How can we make sure that everything we build on

22:45

top of kubernatores is observable by default? What does this them mean for rgo, for flax, for open feature, for keplar, some of these tools you know that I had represented on my podcast, and so we educate people on how to choose these tools, but also get observability by default when they when they decided to choose these tools in there in the landscape, right, Because to apply to be a C and C of project, there's there's criteria

23:21

you have to meet, right, and there's like an incubation process, and you have these certain goals that you have to meet in order to get to that graduated status. Is that correctly exactly? You start with the sandbox, so that's kind of like the first phase of it, and then you need to prove that you have real adopters, meaning that people that are really using your project in a production environment. You need to have people that are maintaining

23:44

the project, that are contributing to the project. So you need to provide you need to kind of prove that this is a viable project that doesn't only depend on a single person, because that could obviously be if somebody is very enthusiastic and build a great framework or a library of obcheck. But it's it's one person. What happens if this person is hit by the legendary bus,

24:04

right, So that's that's uh, that's why you want to. You want to build an ecosystem and a community, and you want to you need to prove it. And once you prove this, then you get into the uh uh, the incubation status, and then into the graduated status exactly. So that's how the scene CF defines the different maturity levels. The details of the criteria I don't know them. I cannot recite them by heart. I'm pretty sure there's a website we can link to get a tattooed on your forearm.

24:36

Oh yeah, I think there's actually a sci fi movie waiting to happen there, you know, because we talk about the bus factor. I bet we could make like a Stephen King style sci fi book or movie about this haunted bus that just travels around the world taken out solo founders. Huh interesting, Yeah, side quest, it was just a tangent. There's the XKCD with you know, a small little self contained component that everyone's depending on that's holding

25:15

up a mess of architecture. So I think I think you would steel like society, civilization Crumball very quickly in that regard post apocalyptic film for sure. So how did you get How many years have you been a cn CF ambassador? Uh, well, I am now applying for my second year, so it's still in my first year right on, and uh I'm still a baby, I guess in that, yeah, in that respect, but yeah,

25:48

no, it's been, it's been. It's been good. And as I say that, just reapplied and so we'll see if they will leave me in the game. They also so with data scenes you have, they invested the program. They also have we have regular check ins where we basically show what we've done in order to contribute back to the scenes safe in order to qualify to be a good investor and not just use the title for nothing, you know, besides doing the podcast that do speak at different meetups and conferences,

26:21

especially KCDS, gubernators communities these days. The topic that I'm currently kind of promoting and advocating for is the whole topic of platform engineering. So I was really lucky so then I think also the Investador title helped a little bit to be picked as a speaker at some of these local events that we have here

26:40

in Europe. And yeah, so yeah, so hopefully hopefully you will have me back at some point and I can then you can ask me the question and say how long have you been an embassador and I can say seventeen years now, and then I will ask you why did it take seventeen years for me for you to call me back on that podcast? Well? Currently I can use the excuse that my emails down, but if it's down for seventeen

27:07

years, I think we'll go fully off. Is it really a problem though, if my email doesn't get fixed, as long as slack and everything else works. How did you get from I don't know if your podcast is related to this, but you know you started the podcast almost eight years ago and now you're CCP ambassador, Like, what was the path here? The path

27:33

here? It's a good question. So when we started the podcast, funny story is that we had a very close friend, one of mine, Brian's close friends, Mark Mark Tomlinson. He actually encouraged us back then to start the podcast. He had his own podcast was called pup of proof Fites. He also talked about performance and then he said, hey, why don't you just start a podcast? And then Brian asked me, He said, hey, do you want to do it? And what does it take and how

28:00

do we who do we need? Who do we need to ask for permission? And I said, you know what, let's just do it and ask for forgiveness later in case somebody is not happy with this, and that actually worked out pretty well. So we we just started with pure performance and what we what Brian Again, thank you Brian for all that he's doing, because I just need to talk and he's doing all the the post production and everything, so I mean it's not just talking. Also, try I bring the

28:27

guests because of my network with our community that DNA Trace the Scenes. You have community, so I have good connections to always find guests. But we what we tried to do is and what we kept is since we've been doing it, we do a podcast episode every second week. This is also what you need for consistency. I'm sure you guys notice as well. You need

28:48

to have consistency. This will help you in recognition and this will help you I guess also with all the different algorithms that then eventually bubble up your pots guests amongst the millions of artists that are out there, and uh, yeah,

29:04

and we've been, we've been. I've always found the most the most interesting thing for me was that I always try to pick guests where I definitely can learn something, right, because I want to pick guests that are experts in the field where I don't have expertise in because this allows me to learn something. So at least the hour that I spent in recording the podcast, it's an hour well spent because I learned something. Whether we have listeners or

29:33

know. I know we have listeners, but that's that's always good. And yeah, we just pre COVID, I was traveling a lot with different different conferences, meeting different people, and so when I thought, hey, this is an interesting speaker at this conference, talk about an interesting topic that is adjacent to the performance kind of main thing that we have, then I asked

29:55

them, do you want to become a guest on my podcast? And that always help and so yes, eight years later, we had, as I just said earlier, the two hundred and first episode, and we were lucky in the early days to have people like Guaranca. She was the head of performance inside with ability engineer at Facebook. We had Kelsey high Tower, We had Jean Kim talking about develops right. We had many amazing guests that I would have probably not had on the podcast if we wouldn't be persistent with,

30:30

you know, kind of building up a reputation. Some luck. Obviously, there's always some luck that you happen to be at a conference and you happen to meet this person, and then all of them are really helpful and they are willing to get on the podcast, and so yeah, so this is how It's how the path kind of went, and here we are and now I'm with Will and Warren. I mean, what else can I ask for? I don't know. I kind of think that's the top that I know,

30:59

Like, where do you go from here? And I think, I think I just follow your lead and I shut down my email and retire, right, go off the grid by a typewriter and a cabin in Montana. Yeah, he sud He was super motivated to talk about platform engineering. And I feel like I've heard this term throughout my entire career, and to this day, I still couldn't tell you what that is, so, you know, funny enough, I'll give you my my explanation a second, but funny

31:33

enough. I was on a was it It was another podcast. It was similar to this, but it was like a discussion where it was we were two and two on each side and one we're saying DevOps is dead, and the other ones were saying, no, depthops is not dead. So kind of like defops is dead, that platform engineering is the new thing, and then the others. This is where I was, depops is not dead because I have you know, been talking about develops for many many years, and

32:01

so they put me on that area, on that side. But coming back to platform engineering, I think platform engineering is trying to solve one big problem, which is that currently when you try to build software, tested, deployed, operated, you need to know a lot of a lot of things,

32:24

right, a lot of different technology. The technology we work on is very complex, and especially if you then try to scale this if you have if you're not like a small startup like the three of us, but if we are one hundred five hundred, one thousand engineers in an organization and you ask from every engineering team to do everything end to end, then you probably end up with a lot of duplicated work, with a lot of wasted time because

32:49

everybody tries to figure out how to build, how to deploy, how to patch, how to do security, how to do network and so platform engineering, from my perspective, tries to provide what's called golden paths. So as in an organization, right if you are I don't know if your job.

33:09

If you are an organization that is heavy, Let's say I'm building Java based applications because that's your skill set, then you provide golden paths where developers can go to a portal and say I need to build a new travel based micro

33:22

service or a new travel based app. And then the platform engineering team provides a self service portal whereas a developer can go in, I don't need to care about how it gets built, how it gets deployed, and don't need to care about security because all of this is bake them by the platform engineering

33:40

team. So the platform engineering team is essentially building an internal product that solves problems of internal customers, which are developers, so that they can focus on writing code instead of having to figure out how to package the code, how do they push it out into the different environments, how to do automated performance and low testing, how to do security checks. So platform engineering, in a simple sense, is fulfilling the promise of DETHOBS and enterprise scale by providing

34:10

it as a sealf service. I think it's a longer nnger than I anticipated, but I would like to add that a little bit too, because the

34:21

way that I treat platform engineering is it's about putting guardrails in place. You know, because every company has financial, operational, security, and legal constraints about how they operate their business, and so I treat platform engineering as a way of putting guardrails in place so that someone from the engineering team can get from their application idea to production while meeting all of those constraints, even when

34:54

they don't know what those constraints are. It's sort of like, you know, when you go to the bowling alley and you put the bumpers and the gutters along the side so that no matter what you do, your ball is going to make it down and at least hit the pins. I kind of treat platform engineering as the same way we're putting in the bumpers to keep your bowling ball from falling off into the gutter. I like that. Yeah, yeah, it's a great analogy. I may borrow it at some point.

35:22

I'm pretty sure I borrowed it from someone else. I just don't murder you. And I think maybe one other thing the platform engineering is a platform engineering doesn't solve one hundred percent of the problems that you have in your organization.

35:36

I think as platform engineering teams, you should, like with any product that you build, right, you want to figure out what are they the pain points of eighty percent of your engineers, whether eighty percent of your engineers waste their time and things that they shouldn't waste the time on, and then come

35:52

up with an easier way for them to do it. And this could be something simple, as you know, create a wiki page to explain how to create a new project with at least the basic security scans, but it can be something as sophisticated as building an IDP and internal development platform using tools like backstage, port Io or others where developer just clicks on the button and then fills out a couple of fields and then outcomes a fully configured Git repository with

36:21

pipelines, with security scans, with observability building and so I think the point is, you know you need to, like when building any type of product, you first need to understand who is your end user and why would you end user use your product? What pain do they have that you can solve? And if you figured that out in an organization, then you're also set up for success because the other way around will be somebody here as platform engineering

36:47

is the cool thing, so let's build a platform. And then you build a platform, and then the platform doesn't help anybody in the organization because it doesn't do doesn't solve problems that their people have right now. No, I

37:00

think that's a really important point that often gets messed. Like if we're talking about platform teams as a team topology concept, then what's often missing I can't believe still today is what you're essentially saying is product management from that platform team so that they know what they're working on, working on the right things.

37:19

And this is also why what I try to stress coming back to savability, because that's what I do on a day to day basis with any product, whether it's your let's say, your your e commerce website that you're building, you want to monitor how many people are using it and how much money.

37:35

Do you make the same issrue for your platform? If you're building a platform, you want to know do you have one development team that uses it because you feel sad for you if they don't use it because they're your friends, what do you have eighty percent of the development teams use it because you actually scover a problem and are they becoming more efficient in the work. So that's why you also need to observe success the adoption platform, and therefore you need

38:00

to solve a problem. And very importantly, your platform, it'sself ulto needs to be available at the time when your developers need it. It needs to be resilient, and it needs to be performing. Because if at eight o'clock on Monday morning and all of your developers come back on the weekends and they all of the great idea, they all start writing new stuff and they use your platform and say avenue adear, and all of a sudden, your platform

38:24

crashes, and you also miss the point. So you need treat it as a product, and therefore you need to buy all the principles of software engineering through that product. Yeah, for sure. It's one of the big misconceptions there that most of you have addressed is that your engineering teams are your customer, and the misconception is that they don't have to choose your platform as a

38:52

customer. They are free to go elsewhere. And so unless you have that frequent communication with them and really understand what they'res are, they will go somewhere else exactly. And you know, I've seen the problem actually even go further than that, whereas like if you just count the number of teams that are using your thing, you don't know if they're actually getting the value out right. If your platform is going to fall over one day, you know,

39:15

that could certainly be a problem. I've seen organizations mandate the use of internal tools that have been built which can slow the teams down and whatnot. I mean, I assume the list of problems here goes on and on. I'm sure Andy has quite a few number of horror stories. Yeah, And I mean that's I think this is what I said earlier. It's the if, I mean, certain things in certain industry have to be mandated because there's no

39:43

way around. You cannot just allow I don't know, somebody in the financial and highly regulated areas so in governments to just pick any cloud vendor in any GEO, that's just not possible. So certain rules of engagement, certain cardhels have to be there. But if you come back to understanding the pain point of your internal customers and then making their life easier, it will be success. If you think you know what your customers, your internal customers need without

40:15

even talking to them and building something, chances of success slower. So it's as simple ast that. Yeah, and that's a hard conversation to have, right, because you're going to go up to someone and it's the same for all areas of your life. You know, you have to walk up to someone and say, hey, what am I doing that's actually not helping you? Or what can I do to help you better? And so you're going

40:39

to get that feedback that's just going to like ship away. You know, you walk into the conversation thinking, damn, I nailed this, and then you have this conversation and they start chipping it away at it and it's feedback that some people don't want to hear, but it's absolutely critical if you're going

40:58

to have a successful platform initiative when people really like negative feedback. Right, So I just had the pleasure We just had our annual user conference from from Dina Trace two weeks ago and I had the pleasure to have Marcellina from Dell on stage and he was also you know, he was talking about how they brought observability as a self service into their internal platform and I asked him, what are the measures the KPIs that you were reporting back to your leadership that

41:37

actually shows that your team, your platform team that is bringing up stability in is actually successful manage any impact. And and I think I need to look at the actual numbers, but it's all the recording is available. But one of the things that they measured is they were serving their engineers and say how much time do you currently you know, spending coding so doing things that is considered toil. And they went from something like thirty six percent productive to eighty

42:07

percent productive. Wow. Right, because they received a platform or they were given a platform that really solves problems. That was you know, taking away work from them that could be automated by making it easier to get their logs, metrics and trastic without having without them having to figure out how to whether write the log how to create metrics. They just Marchia on his team. They just invested a lot of time they were investing in in invester pipelines,

42:37

in more resilient pipelines. So he showed some numbers on how many million of pipeline runs they have, you know, increased in a matter of a year or two, which means faster turnaround, faster faster software engine, faster software development and experimentation. So it's really fantastic. And so this is like the reason why I asked them, what metrics did you use to show us success? I was really blown away by you know, the toil how much that

43:06

went down, the number of pipelines they ran. That means how many more releases they pushed out. Also, the I think he had one interesting metric game we are coming from an opstability perspective or background, the meantime to instrumentation, how long does it take you to actually get the instrumentation? And uh, they it was fantastic, right, good twist on the original Dora metric. Yeah, yeah, yeah, we get the metrics right after our first

43:38

production outage. Do you measure do you measure by the number of negative tweets or what do you do? I mean, you know, I've heard a number of companies though they talk about gray failures that are ones that it's not so obvious that there's a problem and that requires external feedback that a set of serial requests or something like that achieve too high a latency, for instance, and so having that stream seems super valuable to be able to know immediately by

44:14

watching the API from some third party social platform as an observability tool. Yeah. No, I mean yeah, I mean I was half choking earlier. I know that what I was saying is I hope that you have more observability

44:30

than just monitoring the social feedback on the feedback on the social media. But yeah, I mean we I know, we have users, customers of managers and they are pulling in data from the Twitter API and winteresting there were other API social media to get a kind of like a sentiment of what's going on out there. Wow, that's going to be Elon's new monetized product is using x as an observe observability platform. That would require people to still be using

45:01

it. It's true. And I stopped the commenting here because run over. We all have careers on the line here, So next topic please, Yeah, and awkward silence. No, uh, you know, it's really interesting. I was actually listening to some of your more recent podcasts to see how the how it's changed over time, and one of them really intrigued me was the one on AI coming up, and I figured it'd be a topic, and uh, you know, whatever interest you have or insight you got,

45:39

there is certainly appointment for whatever is happening in Double today. Yeah, I mean AI is a topic. I think we all need to figure out what that means going forward, right, because we were all I think, at least from my side, I'm both excited yet it's not scared, but I think we need to be very careful we probably do with it, and I think that's that's something we as an industry, even as a uh you know, as humans, we need to figure we need to figure this out as

46:10

a society. Now from what I believe where AI is obviously great help as you can tell from my accent and for my not private English. That helps me a lot when I write things so that I don't make grammar mistakes, because you know, that's what this is great for. It also helps a lot of people to write code, not coming up maybe with new cool algorithms, but at least writing boilerpled code. And otherwise it's just a time consumer.

46:39

Right, And how the question is right if we are letting if we are ending up creating and letting AI generate a lot of code, and then nobody understands that code anymore, and how can we tell if that code is not doing something that we didn't expect or even in the future something malicious.

47:02

Right, Because if we had a one of our keynote speakers at our conference, she was heavily involved in the I and EI research, and then the security aspect that she was saying, if hackers infiltrate the training material of AIS that generate code, then who knows, maybe THEI generates some backdoor into your code that can then be taken over by somebody. So this brings me back

47:31

actually where I started my conversation in the beginning. If we are trusting the stuff that is given to us without understanding what's happening, then we may end up with the next big performance problem, resiliency problem, or security problem. I mean, for sure, we know that it's possible to poison the mL models so that the result is problematic. It's obviously short jump there to get to the next step where you could do it in a way which causes a

48:04

real problem or opens the back door. Like you said, it's a really good point. Do you see it happening in like the observability space, Like is there just anominally detection or are there things on top of that that you see coming down the pipeline that really are a real step forward that you're looking

48:21

forward to. Yeah, I see so. I think this is not only true for obserability, but with many areas and products where you have a lot of people that need to use these tools but not on a data day basis, which means they're not experts. So right, if I look at at our product and our competitors product, they're extremely powerful products, but most people only interact with let's say the data that comes out in case there's a problem. So maybe this happens once a week, maybe this happens once a month.

48:52

If you only interact with it two once per month, it's going to be very challenging. You're not going to be efficient. So one way how AI can help you. That's also what we've built into the product, and I believe our competition is stand that as well, or other players in the space, is using a generative AI to say, you know, explain that data to me, or create a dashboard for this particular incident so that I understand what's going on for the ara of the code that I'm responsible for.

49:22

So using generative AI to really become more efficient with a complex tool with complex data when you don't work with it on a data day basis, when you're

49:36

basically not an expert. And I think that's where it's huge help. And that's the same with with code generation just on any type of business code, right because you don't write that maybe certain basic algorithms you know every day and I can either then look it up if I need a special sort algorithm on stake go go flow and then copy it in or just say, hey, cope a lot, I need this. Help please set this array in the

50:00

most efficient way. And that's a cool thing, and and I see that this is there Another thing that I in the kind of my opinion in one area in the opstability space is standardization. So we have open telemetry as a standard to collect data, and the industry is kind of trying to figure out a way how can we not only standardize the way we collect data, but how can we also standardize the way we analyze the data that is collected, So like, shall we shall we, you know, come up with a

50:31

with a standard query language, Shall we come up with something? And I believe that these generated VIIs actually have the power to maybe not solve it, but to mitigate this problem because I can say today, I'm using two X and please Copilot give me that data in two legs tomorrow and to move into two Y, and I can still in natural language say give me that data. And I don't know what the language is that they're using to query this data, but just give it to me, because you are the expert,

51:00

right cool the expert. So I think it could also solve this problem. It's interesting you brought up data. I feel like I've been hearing forever, like we have to save all of our data. It used to be for testing, and so companies piled on you know, magnetic tape backups of whatever worthless thing they had, and then it's moved to user analytics and maybe business analytics, and I still feel like every company I worked for didn't know what

51:29

they were doing with it. And now it's maybe we can write some mL on top of it. Best LLM model to I don't know generate stuff. Do you think that's changing, Like, are there still like a majority of the data that companies are saving just isn't worth anything and now it's changing, or they're still finding the important aspect in whatever you're saving. There is a

51:52

very tough question to answer. I think we are still probably saving on the one set, too much data and it is also not the welst fructured, which means the only way to make sense with this is actually to come up with algorithms and using I guess email systems to to kind of make sense of

52:12

this data. But on the other side, I believe we have the opportunity now to educate people, first of all, to come up to use standards, to come up with with with with good practices on unstructured logging, on when they use traces, what should be on a trace, when when you're using when you're creating metrics, you know what metrics makes sense and how should they what dimension should they have, and also come up with with some best

52:37

practices. On the other side, what we do, at least from the company I work for, right we we we try to connect the data because what we've seen historically many organizations in the divisibility space started somewhere, right,

52:52

we ad Din a trace. We started with tracing, so we've been doing tracing for the last twenty years, and then we added you know, metrics and logs and relyuse the monitoring others, you know, I'm sure like Splunk for instance, right, they started from logs and then they expanded into other

53:08

areas. And I think many vendors out there right now, and especially if you, let's say do it yourself and you put in you create your own opstability platform, you have data silos where you have your logs, your metrics, your traces, your events and so on. And this actually makes it

53:24

very hard to then analyze the data and put the context. And so one of the things that we at least try to solve is as we interest the data we connected, we enrich it with topology information, so we know that this log comes from this trace, from this service, from this Kubernators cluster, from this end user. And with that we are also more efficient when we run our AI over the data, when we do analytics. And as

53:52

I said, I can only speak what we are doing. Maybe others are doing it as well, but I think the problem that we should we need to solve as an industry is don't just collect data because you can, but think about when you collect the data, what type of data is it, what's the meaning of it, how does it connect to other data? What are you expectations, what do we expect this data to be Because you're talking

54:17

about data observability. If you have a let's say, a percentage number that should be between zero and one hundred, and if all of a sudden five hundred comes in, there's something wrong. And I'm sure this is the same with with with with other things too. So you want to you want to make sure that the data is clean, that the data is accurate, and that you only collect data that actually really makes sense in the end, because then we can we can also make it easier to analyze the data in the

54:43

end. Having a plan before you starts a good idea, it's always a good idea, exactly. Yeah. And to me that highlights like one of the big advantages of using a you know, not only a performance tool like kind of traise, but having organizations like to see INCF that can because there's a lot of work in making all those decisions right and you have to have a ton of expertise in that area to even know what the right questions are

55:15

to be asking, much less even get to an answer. And so that's where I think having those resources available to someone like myself who's just trying to write application code to run my business actually helps because then I can use your expertise to solve problems that I didn't even know I was going to have.

55:39

And I think that's a hard mental shift for some people, especially younger people, like oh, I can just write this myself and then I don't have to use this product or I don't have to sign up for the subscription service. But there are some long term benefits to, you know, using the product created by a team of a thousand engineers who are solving problems and I'm not even aware of. I just want to add one comment. It's not

56:04

only the young people that think they know everything better. It's also some old people. That's their point, and I include myself sometimes, right, sometimes you just think you have seen this before and you know it much better. But yeah, but yeah, right, yeah, you know the scene CF and especially if you think that this happened. I mean, I'm very happy about open Telemetry already obviously, right, it's in the divisibility space and how

56:27

everybody's collaborating here. All the big vendors are collaborating, bringing in their expertise, contributing the different types of agents for instrumenting application code automatically and then en reaching a bit open telemetry or we have launched two years ago on our almost open feature to standardize feature flagging, and a lot of companies you know,

56:47

are contributing to it. It's really great to see how what the scene CF does because it breaks down a lot of these barriers and walls that you normally have between competing usations for sure. Yeah, because that's a definite change. You know, we've been all three of us have been in this industry long enough to know that that was not how this industry started, you know, like back in the late nineties early two thousands, to think that Microsoft was

57:15

going to do something collaborative with another company was just completely unheard of. Yeah, yeah, here we are. Yeah. We always make the comment, I think, Brian, he always makes the comment, this is no longer the Microsoft that I that I knew for sure. Yeah, it's amazing. Yeah, I mean it's interesting you bring up the collaboration. I see.

57:45

I see. One of the things that comes up is that a lot of these tools are still incredibly expensive for our small scale and I see that being a reason that especially individual contributors if they're just trying something out or building,

57:58

want to even or motive to do it themselves. Actually, it's interesting, we actually have a similar problem at Offerers that we're trying to solve just on the metrics side, and it's just such a small amount of metrics that we want to collect from a volume standpoint, but high number of dimensions, and it's difficult to find something that purpose fits what we're looking for without jumping into a giant product that has a lot of different bells and whistles that go in

58:25

a lot of different directions. And the cloud provider we're utilizing right now just

58:31

fails in that regard, which makes it a real struggle. Well, I mean this is the reason why you know as well as you say, whether there are our organizations that have thousands of engineers and that we resolve into tough problems like high codinality, lots of data, lots of metrics, lots of dimensions, but as you said, not one hundred percent of organizations have this problem right away, and therefore for a long time it's completely perfectly fine and

58:58

to to to use solutions that are free in that open source. But at some point there's a reason why commercial companies exists, and that invests like we we have more than a thousand engineers worldwide. You know, it's not we're not just sitting there and doing nothing and just collecting the checks. We're actually we solved some of these challenging problems that some we many of our customers have.

59:24

Yeah, I mean, for sure a lot trying to make sure that I can even a container you run of an open source project inside your cluster on a machine is incredibly difficult to get a high reliability out of it and configured in an optimal way for your environment. It's not a trivial challenge.

59:42

And maybe that's one one comment to this, and I know we are probably already way over time, but I wanted to add one more thing, coming back to the EI question that you had earlier, because what you just said is it's very hard to configure and find know how to correctly configure your kubernat Is nodes, your pots, your resource limits, your request limits, and again, this is also something where I believe where EI can help us or

01:00:10

whatever whatever you want to call it, right is basically using the best practices and things we've learned from other systems and then applied to workloads we now see.

01:00:22

So we were working with a partner company in Italy. Let's just give them a quick shout out because maybe somebody's interested in Kamas and they are doing they're running experiments AI rhythmic experiments where they are finding the optimal setting for your JVMs, your CLRs, for your containers, for your kubernatores clusters, and they look at obserability data and then you give them a goal. You said, hey, I want to optimize memory and CPU, but I also don't

01:00:52

want to jeopardize my performance. So you can set them goals and then they find the right configuration based on some machine left in that they know that some

01:01:00

models that they build, and they have some really interesting numbers. And I'm not an expert in Kubernetes down to every single configuration line, but this is where again I would go to at least AI to support me and say this is what I would recommend because I've seen, I've analyzed thousands of systems and therefore give you this recommendation if I CEO system and then I can decide whether

01:01:25

I want to go with this recommendation or not. Makes sense. Yeah, And on the other side of that, I find one of the common things I do with AI is whenever I come across a configuration or some code that someone else has written, I'll just paste it over into AI and say, can you explain what this does? And obviously what is in the end means there will definitely be certain jobs that will be less in demand because they will

01:01:53

definitely be replaced by AI. But I think this also gives us much better opportunity to think about instead of doing the same mundane thing all over, I hopefully have now more time to think about how I can contribute my brain power to something that an AI cannot solve. Right. Yeah, if your career passion is creating security groups inside an AWS VPC, AI might be a career

01:02:24

limterter for you. But if you would just prefer to move on with something a little more fun, I don't think AI is at risk of taking your job away. Yeah. I think continual learning as always a requirement as technology evolves. I've seen AI disrupt more companies though than individual jobs in a lot of areas. So you know, like giant companies that, like say stack overflow disrupt it. But as far as individuals and their abilities like that's still

01:02:52

an open question as far as what that will happen. All right, should we move on and do some picks. Let's do it, Warren, I'm going to put you on the spot. Have you got a pick for us this week? Yeah? You know, I have to have to spend all week thinking about that part time job. Yeah, that's that's actually more challenging than doing the preparation for the episode. Honestly. Uh, this one I think I figured out though. In a couple of weeks, I'll actually be

01:03:22

in Dressden for the Decompiled conference in Germany. I'm actually getting a talk there on how AT offers we iteratively added security as as we grew both our technology product and our organization, and I'm sharing that with who's ever else there. But there's a lot of interesting talks for the one day conference that's there, So shout out to Decompile and I'm sure I'll bring it up again before I actually have to hop over for that. Right on, Andy you were pretty

01:03:52

excited about doing picks. What have you got for us? I actually got two for you. Oh absolutely, So first first, the fun thing. So I think it's very important in life. You know, we spend a lot of time in our job, even though we love it and we're passionates, a lot of time. I think you still need to have something outside of your regular work that kind of keeps you, you know, gets your mind of mind. Is salsa dancing. And I know there's a lot of

01:04:16

salsa dancers, so at least you know dancers in our industry. So if you have not yet tried salsa, you should. And the good news is in any city where I've traveled, and I've traveled quite a bit, there's typically always a salsa community. And salsa communities have been around for a little while, so most of them still organize themselves on Facebook. So if you are kind of post Facebook generation, because you're much younger than we are,

01:04:42

then this might still be a good thing to go on. And then you just say, salsa in Boston, Salsa in Los Angeles. You've found it. So that's a just I love it. When I travel, I go dancing and it keeps my mind off. It also is and it's an exercise, so it keeps me fit and healthy as well. Right, and then the second thing is just a it's just a little tip, not if people

01:05:04

are watching. Still, but on the weekend I went ski hiking, so like not not downhill skiing, but you know, I think it's not cost quanty, but you actually go up on the high hills on the peaks ski touring. Yeah, and unfortunately we have a lot of it's like too warm right now, so that means we had to get very high up to actually get some snow. And then it was so warm and I thought I can just be cool and I just you know, go with my short sleeves and

01:05:32

my shorts and nothing will ever happen. Well, my wife disapproves, she would like seeing me. When I was I was like I made a wrong step and I was sliding down a couple only a couple of meters, but the on the ice and the snow, I have a couple of bruises. And so folks, if you think even if it's a warm day and you think you have no problem on skis because you can you will never drip or or sleep, don't do it. Just wear something long slip. It's just

01:06:04

better. It's painful out of us. It's been painful for days right on. So I have two picks this week as well. If you've been listening to the last few episodes, you won't be surprised at my first pick, and that is for platform Con coming up here this spring. Platform Con is great. It's a five day, free virtual event on platform engineering and there's

01:06:35

just tons and tons of great speakers lined up. So if you if you heard what we were talking about on platform engineering today and you're interested and want to figure out how that can actually make your life easier, make your time, your team's life easier, make your company more successful. They'll at least be a few talks to check out on platform Con. And then at the end I'm doing a live Q and A with some of the conference speakers, so that's going to be fun. And then my second pick is for Obie

01:07:10

Vincent. Obi is a he's a personal trainer from the UK, but he has a great website and focuses on a lot of kettlebell type workouts and I'm a huge fan of kettlebells. As of this last year, I've been working with kettlebells exclusively for about the last year, and he's got some workouts and some guides and some tips and stuff on his website Obie Vincent dot com on how to effectively use the kettlebell and get a great workout in with those.

01:07:43

So, if you've never done kettlebells or you're looking to try to get into working out, kettlebells are just super cool because it's a low barrier to entry and it's dynamic. And actually I've noticed in the last year of using kettlebells. You know, I'm fifty three now, but I've actually gotten stronger by stopping using barbells for working out and using kettlebells just because it tends to line up with how our bodies want to move anyway, which translates to greater strength

01:08:16

improvements. So, yeah, go check out Obi Vincent dot com. Do you have some in that room there that you want to show off? No, they're actually out in the shop, but I can bring some in for the for next week's episode, for sure. Yeah. Absolutely, Yeah, I've got I've got about five hundred pounds of kettlebells and different sizes. Yeah, which that's definitely not needed. That's yeah, yeah, yeah, don't don't do that. And god, I've got to finance this. Now you

01:08:53

can. You can just grab one kettlebell and start with their So that's like the upper If you're you're hitting that number, you know, maybe you need to see over right, or maybe like talk to your therapist about what you're really not focusing on. But that's a subject for a different podcast. Awesome, Andy, thank you so much man. This has been a great conversation. I really enjoyed having you on the show. Thanks for having me. And yeah, yeah, as I said, you know, seventeen more years,

01:09:29

seventeen more years, so now we'll be back. I'll put it on the calendar. Yeah for sure, Warren, thanks for joining me. Love having you as my co host. It's been a blast and look forward to many many future episodes with you and to all of our listeners, thank you for listening, and be sure and check out the Pure Performance podcast as well if you aren't already. And we will see you all next week.

Transcript source: Provided by creator in RSS feed: download file

The Role of AI in DevOps: Observability, Security, and Efficiency - DevOps 194

Episode description

Transcript