Navigating Work Patterns and Internal Tool Reliability in Engineering Teams - DevOps 197 | Adventures in DevOps podcast

00:14

On, y'all, welcome to another episode of Adventures and dev Ops. I'm your host, Will Button joining me in the studio, my co host back on a streak making Tom Brady look like a slacker, Warren Parade. Welcome Warren, and thanks for letting me come back for an of my hopes up that it will keep on going. I'm Warren, I'm the CTO of author has just to reintroduce myself. Yeah, I mean, I like how this is going so far, and I have no plans to coward my way out

00:50

of the future right on. I'm excited to hear that, because otherwise it's just me and the guest and always lead to trouble. Speaking guests joining us today Pete Fritchman sre infrastructure staff infrastructure engineer over at Observe, Inc. And Pete has joined us today to talk about observability in those pesky internal applications that we all have. And I don't know, it might be a love hate relationship there, but Pete, welcome to the show. Hey, thank you

01:26

for having me right on. So tell us a little bit about your background, because you have You've got the staff engineer title, which takes a while to get to and is also I think a little bit uncommon on the infrastructure side. You know, it's pretty common on the software engineering side, but I think it's a little more rare to see staff engineers on the infrastructure side. So tell us how you got to that point. Sure, yeah,

01:57

I mean I've been doing this for a long time. I had in the computers as a kid, which I guess is more common these days, maybe less so in the nineties. But I knew early on as a kid that I wanted to do computer stuff, and I wasn't exactly sure what computer stuff was. Something with programming. I enjoyed that kind of you know, make the computer do what I want aspect of things. I was really lucky to land an internship at my local ISP in like eighth grade, the summer after

02:27

eighth grade, before ninth grade. Oh no, I had a great mentor there, and I haven't kind of. I had done a mentorship like my seventh grade summer doing Linux stuff, and I bought a laptop and ran Linux on it. And running Linux on a laptop is a great way to become one with Linux, you know, hate it, love it, whatever. And working at this ISP and having a really great mentor George kind of like I realized, Okay, CIS and Men is the thing I'd like to do,

02:54

Like I enjoy debugging these problems and building things and writing automation. So, you know, I did the high school thing. I worked all through high school. I'm in a workaholic forever you for better or worse. I worked at this ISP all through high school. Went to college for a year. Wasn't my thing. It was fun socially, but the school part was just not. I don't know, I just lacked the focus. I just really wanted to work right. So I had contributed a bunch to FreeBSD in

03:29

my high school days. I was a ports committer and ports are like the packages in previous D. And through that I landed a job and a really awesome group at FedEx doing system administration on everything Internet facing. And it was a small group, and looking back, like that was the group to be in at FedEx for doing Unixie stuff. They were definitely ahead of their time.

03:53

Everything was automated. They wrote their own tools, and I thought this was very normal for like a two thousand and two shop, which I you know, now we know maybe was so the whole automation. First thing has always just kind of been how else would you do it? Kind of thinking, and then I was lucky enough to land and that's every gig at Google after that in two thousand and five, and you know, they were like, hey, we should automate things and I was like, well, yeah,

04:16

how else would you do it? Right? And then just from there it's been a whirlwind of startups. And I tried the banking world for a little bit, had some fun there, has some not fun there. I ultimately decided to go back to the startup world because I think that's my true you know, that's where the for me, that's where the most fun is the most fun. I had a bank was kind of a startup in a

04:39

bank, and that's very hard to find, oh for sure. So yeah, cool, Yeah, I mean there's definitely a huge It's it's almost completely two separate, completely careers working at large enterprise organizations versus startups, Like you have to have two different mental models to be successful at each of those. It's almost two different skill sets. I mean the base technical skill set of course, and then you know at a startup you kind of have to pick

05:09

up the pieces and lead the way with what you have. And then at a big enterprise you have all the resources, but you also have all the politics, and so you have to you know, as much as you know, everyone I know hates that you have to play. You have to make friends in other organizations and figure out how to influence people, and it is hard and stress. I enjoy the stress of tech, and the stress of that is just it's a lot. Yeah, it's a lot of political engineering

05:36

versus softer engineering laying right, for sure. So when you started work at the ISP, like are we talking back? Like dial up modem is P. Yeah. I sat in the after room. I had a I had like PM two e's next to me and those are quiet at least, and then I had a rack of like fifty six k's that were just you know, I'm pretty sure I could whistle my way into a fourteen and four connection. Back then, we were like the regional ISP, So we did t

06:09

ones for businesses and they're there. Big thing was they would put the I forget what Cisco it was, but there was something where you can take the t one and like, oh, I want to take some of my parents and use them as phone lines, and some of them as data and that was like a revolutionary in ninety seven, so that yeah, for sure. Yeah. I was working in the telco industry right around then and nice.

06:28

Yeah, and then we got you know, d slams and until DSL thing came through, and it was just it was a fun way to get a lot of exposure to a lot of different things. I got to go to pops and put stuff in, I got to build Unix boxes, I got to kind of do the whole gamut of things. So you got the benefit of dealing with Y two. Yeah, you know like it. Yeah, it was. It was such a non event that people at work weren't even

06:50

worried. You know, everyone was very much a realist there. They were like, every whatever, everything breaks, we just live in the world or nothing works, and you know, we passed ourselfware. What more is there

07:00

to do. I knew people. I didn't know them like very well or personally, but I knew people pre Y two k that the area I lived in at the time, they built underground bunkers and stopped them and they were going like late December they were going underground and saying we're not coming out for ten years or something, and I've never seen any of those people again, so I'm like insanely curious. Were they of January? Are they still in Yeah? Yeah, yeah. I didn't to have like the you know,

07:33

I didn't know Kobal. I mean, there was a whole you know, crew of people that were just making insane money and projects. I'm just watching ka prep needed or not. I mean, I feel like we'll hit that again in twenty thirty eight with thirty two bit. I t like that that one is actually a little scary. I hope to be retired by that,

07:50

right, no doubt. That seems like too much cool. So so talking about internal platforms and observability, Yeah, like that's that was whenever this came across as our topic, I was like, you know, that's that's just brilliant because all of these little internal apps, and sometimes not little internal apps, but like the things that the company uses to make decisions about are we doing the right thing or not? They often are just like little pet projects.

08:26

So what's what's your experience there and how do you get those recognized as the valuable assets that they are? Yeah, Well, often it happens on its own, and but it's the worst case of there's a catastrophic failure at the worst possible time, and everyone goes, oh, that was really important.

08:46

I worked somewhere at a public company, and the system that did the closing the books every day, you know, like very barring, right, Well, it wasn't working and it was taking you know, thirty hours to close twenty four hours worth of books, and obviously that doesn't work so well. And they kind of didn't notice it or take any action until, like they were a week before they had to do some sec filing quarterly thing.

09:09

I don't know all the details, but it became, you know, suddenly on Friday afternoon, it was we need all hands on deck to fix this. But it's everything from that to not just necessarily apps, but like internal infrastructure, right CEI and developer experience, and everyone's got these you know, shell script infrastructure that dev's run to run the local cluster on their laptop, and there's you know, any company you can probably sit down and find a

09:35

million of them. I just think they're perpetually under you know. The sexy part is production, right, people want to build SLOs for users and show awesome graphs and look at this great incident management process. We have. Then they have this like Jenkins instance, it's barely holding on internally, you know that everyone hates but no one really talks about because it's just, oh, yeah, that's life, you know, that whole side of things just need

10:01

more love. Yeah, And so I think part of that, you know, like you mentioned there, especially with like around Jenkins and things, it's people. It's like people who build tools to solve specific problems that they're having, and then it almost has like this organic growth of other people see it and I'm like, oh yeah, I need to use that too, and then it grows in its role in the company. So how do you identify

10:33

those before that catastrophic event? Yeah, well, you know a lot if you think about So I do a lot of thinking about like how to compare it to production. So in production, if you say I want to launch this new micro service or I want to consume this two AWS service, you write the sign bocs, you have launch review. I mean hopefully right, ideally you have these things, you have this whole like rigmarole and process for

10:56

it. But eternally it's like, oh, I'm going to fire up a EC too and run this new tool I downloaded hey, two weeks later, it's important you have to apply that same same principle. And part of it can be a staffing thing, right, Like, you know, you might have this team of a million people working on PROD and then you got the two you know, the IT guys working on infrastructure stuff that doesn't work very

11:15

well at all. Yeah, So I mean, I think you have to put some process to it, and I mean I think that I'm a big fan of the whole infrastructure management, incident management, post mortem process. Like, I think that that is a great way to drive, you know, out of just happen. And we all have to accept that nothing's one hundred

11:33

percent, but you should get the most out of every adage. So when your internal tool does blow up and cause a you know, company visible, everyone is like, hey, I don't really know what this was, but we couldn't do business for a day. Make the most of that, right, Hey, this thing broke. By the way, there's five other services we've identified that are in the exact same state and could all blow up tomorrow and we'd be on the same call. And so that's your that's your you

11:58

know, entrance point to getting everyone to have attention on that thing. And it's unfortunate that, you know, things have to fail first sometimes, but that's how it goes when you're especially at a startup where you're kind of have conflicting priorities. Oh for sure. Yeah, And I think it's a really good good point there is, like outages happen, so let's just make the

12:18

most out of that learning experience. Absolutely, which is that's been a cultural change for us as an industry over the last I don't know, i'd say ten to fifteen years. I feel like Etsy published that blame you know, the famous blameless postmortems. It feels like that's ages ago now, but it's still so relevant every day. Yeah. Yeah, things broke, who cares how they broke. Let's just have it not break in the same way again, right, And if you can do that, that to me is like

12:46

the health of a necessary team, right, it's not. Yeah, I really hate places that measure health by oh there were four outages last quarter, okay, like where they repeat root causes, Yes, okay, maybe there's a real problem. But if they weren't repeat root causes and you're solving root causes and writing good postmortems, O just are great. Yeah, yeah,

13:07

for sure. And I think it's I think that's one of the things that it's really hard to convince people of is outages aren't as bad as you think because where most of us work like we can recover and and I say that coming from a background where sometimes when we had outages, people's lives were at stake. And so if you can walk out of an outage and say no

13:33

one died, like we're going to be all right. Yeah. Well, plus it ties into SLIS SLS air budgets, right, people a lot of times before the SLO concept became very popular, like it's five minutes of downtime bad? What does that mean? They have this air budget and you can go, well, you know, we're a three nine service committed to we have forty four and a half minutes a month. Okay, five minutes isn't

13:58

great, but it's the fine. You know we can. Yeah, but if you have you know, hey, I've gone past it and we're writing checks or you know, giving SLA credits like okay, now you know we have this quantifiable not how people feel number to talk about things and yeah, so do you for internal apps, do you take them that far where you give them assign them slis and SLOs and slas for performance. I've been starting to so this is actually my first, my first gig out of many where

14:28

I've really kind of focused on internal stuff. I've often been the problem of, you know, the person focusing on proud and you know, on a different team that now my team is kind of you know, wearing all the hats thing here, and yeah, so I'm we're putting SLOs on internal services and internal workflows and trying to treat them with the same we have that in production, of course, and we're trying to give it the same, uh,

14:50

the same level of you know, thought and execution. I've seen a huge shift in the industry over We're all like where some teams maybe they're called platform teams, have sort of been excluded from thinking about what a product is or how they do product management or even product ownership. And I feel like that's really been turning around, first with the DevOps movement and now just I'll look at how we build services, micro services all all together and no one

15:18

sort of excluded or have a different process of internal teams. Doesn't matter. You're still offering a real service to someone. Just happens to be your customers or within the same customer or same organization. That's exactly how I sell it is. Yeah, you still have customers, they just happen to be coworkers, right. Yeah, you can still do the same stuff. You can build user journeys. You can figure out what does a customer expect, what

15:41

makes a customer angry. It's a lot easier to figure it out because you can ask them and slack instead of guessing, like, man, what are my customers? What's the threshold at which my customer complains? You can just straight up ask them when they work for you know, I don't think they necessarily have good, like nice answers though, Like you're a customer, you find there's a variety of personalities, right, people, the moment of Jenkins

16:00

job doesn't passes in four minutes. They're like, ah, Jenkins is graph It's terrible, but at least you can get it. You know. That's how customers are. You know, are in general right you have so you get the whole gamut of emotions. But you can tell when there's an outage. You know, how loud people are. Internal people will tend to be louder internally that in your customer I mean, that's what I mean, Like you could definitely like it's still a problem in some regard, but like how

16:21

do you temper the difference of perspective? Like I feel like users on the outside are more quiet in a lot of ways, like it's difficult to pull information out of them, and internally, as you mentioned, you know, it's like everyone screaming as soon as something goes even a little bit. I think that if you can show them that a there are graphs, right, like there should always be a graph, you know, like generically I work

16:41

for AGRAD that always sad, where's the graphere's the graph? And it was annoying at first, but you're like, okay, maybe there should always be a graph so people know that You're, hey, people are actually watching Jenkins.

16:51

People are actually rather computers are watching Jenkins, not people, right, And if you show like, hey, we had this outage and yes it was a really terrible day and no one shift any God, but here's this great postpartum everyone can read and that's got a reasonable trigger and root cause and follow ups and timeline and we're actually closing the follow ups. Like it's very visible and I feel that should be the same way with customer outages. It's

17:12

very visible, right, nothing, there shouldn't be anything to hype. Yeah, So one idea I've been working on because I work a lot with startups and like, the whole thing about a startup is odds are the product that you launch is not going to be the product that you're successful with, and you're going to try a lot of different things before you become successful as a company. So how do we measure that so we don't spend any more time

17:41

than it's absolutely necessary working on the wrong problem. And so I've been working on this idea of success criteria, like what does it take to make this application successful? And a lot of my background is in the infrastructure for mobile apps, and for mobile apps, you can measure it as you know, we need ten thousand monthly active users spending forty five minutes per week in the

18:04

app or something like that, measure engagement. Yeah, yeah, yeah, So like because then if you if you tie in, like you know, our our total cost of acquisition to get a new user as x amount of dollars, and our cost of infrastructure is x amount of dollars per user. You know, you can do some pretty simple math there to find out how

18:25

many users you need for this to be a profitable product. And so trying to figure that out and get that in the early stages of an application so we know when to either double down on this application or shelve it as soon as possible. I'm wondering if like that same thing applies to internal tools, and how do you how do you define what that looks like? Yeah,

18:51

that's a good question. And I think I had like a good you know, where we deployed the like a code analysis tool, and yeah, I don't it's a good I don't know how to quantify it for stuff like that. But again, we have some of the very active developers that will give feedback of like, hey, this then gave me a bad analysis and which we can and make it better. It's like, okay, that means they're

19:12

actually looking at it and you know, making code health better. And I don't know, I think the like, I think you're on the right track with you know, having a number like mus and that kind of thing. It's tough internally, Yeah, yeah, well, and it's interesting because yeah, well, I think there's two classes of internal products, right there's internal products that people are very opinionated on. Code analysis is one, right, Like, I think this gives me bad code analysis. I don't like that.

19:41

But CI right, I think largely people don't actually care what CI product you're running. They care do my prs get approvals fast? Is master green? The deployments happen at the pace that we think they should at our company, and if that works, who cares what the product is? Right, they're like, And so for stuff like that, you can just be super results oriented, right, Like, hey, is are the s lives? These our solos we're trying to go for we're making them, we're not making

20:08

them. We're a d inking shop now, and we're considering alternatives. And of course, you know, you ask people what we should. There's a million things, but no one actually cares what we run, right, you know, if they can declare their jobs in some declarative way that's not too terrible and it works, everyone's happy. Yeah, I'll tell you what'll make them care. If they have to be the ones to convert your drinking jobs to the new platform, then a lot of people will be like, ah,

20:37

you know, Chinkin's is probably okay. There's a lot of GitHub actions talk because a lot of people have done stuff there. So I mean part of you know, part of that is that's the politicals or well, the human side of it is, Hey, maybe it's if all of the products can do the thing, maybe we pick the one people have the most experience with just to make life easier in the transition. That's a good point. I mean, will you ask question, which is, how do we know

21:02

our thing is going to be successful at a startup level? And if we take the perceived public metrics on ninety percent fail are what a companies do with teams that are part of that ninety percent? Do they let them go? Do they reposition them? I think that's a huge struggle and it's scary to know that there are some metrics associated with your successful even within a larger company

21:29

that you don't have job security in. Yeah, I can tell you just in my experience with startups, that's like when you identify the success failed criteria answers that question because if you can identify it early, then you take those people and you put them on what your next idea is. But in many cases, if you wait too late to identify that this product is not going

21:56

to be successful. You've already been bleeding too much cash for church along that you've got to salvage what's left, and the first thing that goes when you're trying to cut costs is your staff. I wonder if there's a lesson to be learned that can be pulled to internal apps. Though in the startup market, there's this idea of letters of intent, right getting signatures from potential customers even before you built anything based off of that idea, and maybe maybe there's

22:22

an idea here of how to transition this to even internal teams. Yeah, I think, yeah, I like the idea of formalizing it. I mean, I think what I often do for these kind of things is like you find, uh, you know, hey, you might have ten teams and six teams really care about this thing you're building because they're big use of it. You find like a champion on each team that is, you know,

22:44

opinionated and is willing to talk to you about what's good and bad. And maybe not a letter of intent, but you write a design doc and then you make sure that they've read it and give you feedback, not just to check, but you know, if anyone reads a design doc and has no comments, they didn't actually read it. Like I've never met an engineer without

23:00

an opinion on a design. So so you get you know, you get feedback from people on these teams, and it's maybe not as formal as a letter of intent, but you kind of have the you know, social buy in, right, Yeah. Yeah, And that was one of the things that we whenever I worked at Active, I started with them super early and

23:21

we just accidentally got this right. But we built this platform for all the engineers to build and deploy their software, and the way we built it, we we had such great collaboration between an infrastructure team and an engineering team that anytime the engineering team wanted to expand the capabilities of that platform tool, most of the time their request came in the form of a poor request to just

23:49

add that feature to it. Yeah, and I've never been able to duplicate that since then, but that had such a significant impact on how I think about platforms that that's my goal every day. Yeah. I like that. It's kind of the open source approach, right, Yeah, Yeah, that's I mean, I think a lot of that is the buy in, right, because if you're using a platform or a language that no one else wants or knows or likes. They're not going to learn rusk to send you a

24:18

poll review to fix a thing they don't like. But if it's already in go and the rest of your codes and go, yeah, I just show up one morning. Maybe there's a corollary here as well to the total economy in the startup world, where you know your customers are somewhere along the spectrum of innovators, early adopters, early minority, all the way to laggards, and it's really who are you talking to? You know, what does the

24:38

rest of your organization look like? Because I feel like, well, you would need innovators and early adopters there who not only need that functionality but have a huge stake or care about how it's implemented, and not just people who think it's table stakes or just belongs there or have opinions that they just want it done their way. Yeah, yeah, true, yeah, And that's I think you hit that. I think you hit it right on the head

25:06

there that early stage startups require that innovation mindset. For sure. They're not going to get very far if only a few people are thinking about how to push what do you bring their product is forward. Yeah, I really a great and I want to clarify that there. You know, there's there's a certain type of person who just wants to show up for work and do their assigned job and they want to do that for twenty or thirty years till they

25:33

retire. And there's absolutely nothing wrong with that, just that an early stage startup is not the right environment for you to be successful. Yeah, you definitely see that. You can find out the bigger companies. You have these classes of employees, right, there are some people that you can tell either came from startups or are going to go to startups afterwards. And then like you said, there are some people that just you know, close tickets and

25:56

do the work. And that's yeah, completely fine. Yeah, I sometimes wish that I could be that person. It'd be a lot lessful. And yeah, I don't know, maybe I will be in five or day. Who knows, right, people change, But yeah, So let's talk a little bit about the types of internal platforms, because we've talked about you know, like CICD, what other tools are out there that we should be looking

26:26

for as internal tools. One that pops up immediately in my mind only because I've been bitten by this one at every single company I've ever worked at for the last three decades is the internal data analytics team. Like those guys have monster infrastructures and are just doing what it takes to generate the reports that the business wants to see. And whenever you get a hold of it, like, oh wow, I'm not even really certain how this thing is working.

27:00

So what are there to break? Internal tools are out there. Well, some people run their own vcs, you know, get get lab on prem h we run Garrett so that's its own you know, Carrien feeding and that's the loos. I think one thing people actually miss is the monitoring infrastructure themselves. You know, if you run Prometheus and Elastic or if you don't have a vendor, right, and even if you have a vendor, you have some component that is shipping metrics to them, right, you have to monitor

27:30

that. If that stuff all breaks, you're flying blind and you need to know it. I think that often goes neglected, right. I worked somewhere where someone came to us once and they were like, you know, they run a really noisy on call rotation, which is its own problem, and they knew things were broken because they had I haven't gotten a page duty in ninety minutes, and that was their escalation and I was like, well, it's actually broken, and this is so sad that this is how we found

27:55

out about the problem. On many levels, right one, we didn't know

27:59

too. You expect to be paged at least once every ninety minutes, but it's it's truly I think the monitoring infrastructure, it seems like it should be obvious, but it's not necessarily obvious that do you think that's Do you think that's a mess from like existing observability tools that they aren't able to have internal metrics on what the expectation is on getting logs, Like I feel like we've used cloud Watch for a while and one of the things it does in AWS

28:25

is you can alert on missing data. Well that's a hard like in Prometheus, it's very hard to alert on missing data right end of alerting on like the uptime series being zero. But then what if you have a problem that generates your targets and there is no uptime series right, Like there are a lot of these different things. Yeah, I think that it's not Yeah, I don't think it's first class of you know, watch the watchers. I think the data is there, though, and you have to kind of do

28:49

it. I worked somewhere where we had a relatively large Prometheus and Grafana infrastructure for metrics and a really large Blunk infrastructure for law, and both of them had their problems and they're run much two different teams. And we got together and said, hey, let's make a deal. Right, We'll write a Probert for Splunk and have an export metrics and monitors Spunk and Prometheus. You write a prober for Prometheus and have it writ logs and monitor and Splunk And

29:15

that's not perfect, but you make with what you've got. And that was a company where it was a large place that was going to be hard to bring in a third party tool to help us, and you know, but we may do with it. And it was super successful. We found lots of problems with you know, if they both died at the same time, Okay, yeah, that's the end of the world. But the world is probably already ending if they both died. So do you think there are third

29:37

party vendors that actually solve this in a reasonable way? Well, I mean, I think it's hard to say, right, It depends what you're looking. If you're using a third party vendor for monitoring, you know, you need to look at your metrics of shipping. Is your data there with them? Right? So you have to set something up in their system to do it. But then you do you want to back up? Uh? You

30:00

know? Do you want to know if they're up? You may have to run a little side monitoring infrastructure too to watch them because it might not be anything you can do about it, but you may want to at least be aware that, Hey, the thing that normally sends me alerts is not going to send me alerts. Maybe we should all be go back to the knock days and you know, stare at some things for a couple of hours.

30:22

I mean, that's exactly what I don't want to have to think about, Like I don't have to think about my vendor being down in some way that requires me to monitor them. Like I feel like, you know, if I'm paying money out for that, I should I should be getting that by default. I don't know. Maybe I think that's just my pessimistic brain.

30:37

Yeah, for sure, everything will break. Everything will break, So whether you're building it or not, it's going to break and you either want to know about it or you know, if you think something doesn't break, you're

30:49

just not measuring it. Yeah, that's kind of mine. Yeah, yeah, I have two examples where I think that's a really like a lot of effort has gone into that Core Logics as a logging and monitoring platform, and they have anomaly detection, and so if you'll if you have a service that normally spits out, you know, one thousand log entries per hour, if it stops, it identifies that as an anomaly, and we'll trigger that and say, hey, nothing is like technically alerted, but something's changed here.

31:25

And then another one that we just recently started using as a tool for monitoring our infrastructure spin called clouds zero, and it has a really cool anomaly detection as well, says, hey, this project, their current spend rate has been this, but today it changed to this, which is you know, not necessarily a problem, but cool that it acknowledges that and you're like,

31:49

yeah, why did that change? I think the correlation analysis like that is really good for not necessarily paging someone itto a, but reports and debugging like hey, I have a problem started two days ago, what else changed two days ago? Right, Hey, you also started spending less money on this. It's like, yeah, she doesn't mean causation, but I'm certainly going to look there first. Now. I think people underestimate that really though,

32:15

like just looking out, when does the problem start? What else happened there? And I feel like it's so obvious to say that, but it's usually one of the first things that's missed. Yeah, for sure. It's been like a pet project of mine for a very long time. I have a

32:28

super stale, busted repo trying to do this. Back when, uh remember CEP continuous event processing, it was all the hype and like maybe it wasn't all the hype, but it was somewhat popular in the late two thousands, I was trying to do the uh whole Winters forecasting, Like the thing that's built in is that you know already tool added that thing again ages ago, dating myself. You could, you know, put the prediction line in front of your graph, And I always thought like, yeah, let's just do

32:54

that, like let's predict every time series. Like it may not be great signal and I'm not going to look at it unless something's broke, but hey, I've got twenty thousand metrics. I'm not going to look at all of them, right, So when the site car starts getting slow, it'd be really interesting if I could also see, hey, i'll wait, time on the database server went up. It's okay. Are we sending queries that a disc go bad? Oh? Hey, look errors on this disk spindle started

33:16

going up too, Like maybe there's something here. And I think it solves another monitoring problem that people still make is people monitor root causes and not what they actually want out of a system, right, like sending me an alert that my CPU is high at too in the morning, Like, cool, we're using the CPU we paid for, like comes up back to bed right, awesome capacity planning. But if it's making the site slow, page me and say the site is slow, right, don't use me and say don't

33:45

page me and say job is doing GC pauses? Well, yeah, you're running travel That's what it does all day. If they're too long, you're going to fail a latency monitor and page me on the thing that matters. So like, I want these root causes to be the correlation engine to kind of help me figure out. Okay, well eighty thousand things go wrong, and if you put an alert on all of those. You're back at that

34:05

team that notices you don't get paid for ninety minutes. Right, Yeah, if you look at every team that has page fatigue, this is exactly the problem. Yeah, most people don't even know why they get paid for ninety percent of the things. It's organically grown over the life of the team. Hey, this one time, you know, the database backups did this thing terribly, and now we page on it. And then you have so many of those it's like, well, there's eight thousand alerts and most of them

34:30

resolve themselves, and we just thrug our shoulders. And I don't think it starts from a good place. Like I think there's this idea that just having

34:37

dashboards is valuable in itself, and it sort of propagates from there. It's like, oh, you know, what are all the metrics we could be collecting, and let's show so. And sometimes there is an organization that thinks that there is some inherent value, like oh, this is how many requests we're getting, Let's use it to get more headcount because we have to support

34:55

such complex things. Or I remember a previous company that thought it was really cool to have a geolocation dashboard of where things were happening all over the world related to their thing, And I'm like, you know what, I did need to drive that with real data. I could have just randomly pinged a spot on the world on a flat, you know, two dimensional map and be like, look it's happening. I didn't need to know the lights coming

35:15

up all over Yeah, for sure. It's just so totally unnecessary. And so I've been on this maybe personal vendetta here to make them be actionable. So you know, what is the business impact? But more than that, you know what will you do with that information? Is the site being slow? Will you actually take an action here to make it faster? Or is there like a run book or something that we can go and actually execute on. Well that's another interesting part of it all is like the term run book

35:42

has been so ruined by so many places. Like if your run book is log into the server and run the script, like yeah, it's just not automated, and like I think our run book should be these are the places you should look, like, if it's ever a thing we know that breaks, fix it. Like, yeah, I talked to someone once that was saying they were very proud. I mean I felt that or they built this great thing and like we had this service the crashes all the time, and

36:07

we built this great thing that automatically restarts it. I was like, okay, I see right. How about collecting stack traces and sending it to developers and fixing the reason the service crashes? All like, yes, you should restart the service when it crashes, no argument, But like, that is not the end of your journey. That is the first ten percent of your journey, right, you know, I you know, I love that because

36:30

it actually happened at one of the previous companies I was in this. The name of the service was the Service Monitor Monitor, and it did actually do this. The root cause, though, and maybe you've got some infinite wisdom here, is they were using a library for math operations that had a memory leak in it, and so this would have required actually contacting a third party company to get their open source software actually fixed, which I think was under

36:57

a proprietor use license. So sometimes you have to do some ridiculous things. Yeah, I mean, I guess if you're in a world where you have to use a vendor library you can't fix, you may just have to you know, live with the workarounds and acknowledge the terror. Yeah, no, for sure. The problem is when you start expending that to every problem that looks similar, rather than knowing that it's the right answer. Right the Nightly

37:23

Jenkins restart. Yeah, well go ahead. I would say, like, you know, a lot of some bank places I've worked, we're like, hey, we should just you know, we're twenty four by five point five, right, you know the Japan and US markets and they're closed, you know, Friday night. Let's just reboot everything every Saturday. I'm like why, Like why not? And I was like, I think I asked the

37:51

question first, like like yeah, we should. We should reboot things when we upgrade the kernel, and we should use that time to apply upgrade. But just doing a bunch of stuff every Saturday because we can because there's no services. It's like it feels like we're just making busy work and you know, finding things to break on a Saturday and ruin our weekends. Yeah, but there's also really hard arguments to argue against, even though you know there's

38:15

somehow fundamentally wrong. Yeah, we didn't do it. Regress, Yeah, someone you know, there was a change set out to put a reboot in a crown job, and I was like, that is no, that's something is going to be terrible there, and I don't want to that's how to ruin a weekend one on one. Yeah, totally. And it was already like that because I don't know. I think twenty four by seven is a way better environment to have fundamentally good operations than twenty four or five five.

38:47

There's just too many bad habits in twenty four or five five. Oh, we can always restart this by turning off the database and running you're turning off the clients and running an alter. Eventually you're going to have to run an alter mid week during trading. Yeah, you're not gonna like it, but it's gonna happen, and you're gonna have to take an out us to do

39:02

it because you haven't figured out how to do it. I mean, there's two obvious failure modes from now, which are, well, what happens if something changes on the restart, like you know, a new upgrade or something right now you're triggering that at an unpredictable time, or just straight database replication crashing and losing out on whatever what was in the journal at that moment,

39:22

that's not the thing you want to actually have happened. Yeah, do it in your lab to figure out how to deal with it, but maybe not proud. Yeah. So You've mentioned Jenkins and Garrett, and so I feel like I'm picking up on a trend here that you're running a lot of services in house, whereas other companies may choose to use SaaS providers for those Do you have a particular opinion on that? Well, I mean I feel like my whole career, I've been at all angles of the build versus by debate.

39:54

It was funny some of the banks I've worked that you really any of the big orgs you see, like the historical cio CTOs whoever they are, or one wile will get hired and there we have to build everything. Then then you can you know, you have ten years of this right, and then you look back and you have all these disparate things of like, oh, this must have been built during the build era of two thousand and nine.

40:17

I mean, I think the answer is, really it depends on the staff and what their expertise is, and should you run Jenkins in house? What can you run Jenkins in house? Right? Like do you have are

40:29

you going to dedicate the resources to do it? Right? But just outsourcing something isn't always as easy as just paying somebody, right we mentioned before, it's porting stuff there, it's operationalizing it, and actually, like taking taking advantage of a SaaS service to do provide value is sometimes just as hard. The challenge isn't running the infrastructure, it's using the infrastructure effectively. Right, Having Jenkins doesn't help. Having jobs that do meaningful things are important, and

40:58

you have that problem in any CI infrastructure. So I think some of the things that doesn't matter, you know, observability, right, it's really hard to run high availability observability. It's also I'm mouthful to say so, I mean, like I'm a fan of I mean worker reservability company, but I mean I'm a fan of outsourcing some of that stuff. I think there's a certain scale where you run it internally, I think very few people are at

41:22

that scale. Yeah, very few people are actually running for four nines proper, Like that's I don't know if it's a triple digit number, but it's a small amount of companies and actually a lot of people think they're doing it, but not a lot of people actually do it, right, It's interesting. I mean, we're we're five nines on our core competency service, which

41:42

is like identifying stuff, but yeah, it's huge. We're not we're not running the monitoring observability stuff ourselves like we are, like we've found we're using our cloud provider or actually we're still in the process of trying to find a vendor that actually works with us. And I think that's going to fuel my next question. I'm curious whether or not you see companies get the build versus

42:00

buy decision, right. I know that they put a lot of effort into comparing vendors, but then I feel like it's sort of there's this gap on actually being able to correctly identify what the total cost of ownership is if they do actually go and build or run something themselves. You know that that is a tough one, right. TCO is so hard because it's easy to do the this is my AWS bill, this is the vendor bill. How do you quantify the adages and the people and the stress and the upgrades? And

42:30

yeah, I haven't seen a good answer for that. I mean, of course many you know, any vendor will try to do that for you because they of course want to help. Yeah, good, but that one is tough, I mean, but seeing it done right. I think the most common mistake people make when they're doing the evaluation is writing a requirement stock is surprisingly hard. Yeah, like writing a requirement stock for CI and don't use a single product name right right. People say, like, my requirement is

42:58

an Envoyd proxy. It's like, there's absolut no way. If your requirement is an on Boid proxy, what have you done? Right? If your requirement is a thing that speaks XDS, maybe envoys the only answer. But write down the thing you want, the outcome you want. This is the same root cause versus slow thing, Like it's a it's the x Y problem as well. I don't know if yeah yeah, people say I want to do this thing is like, what are you really trying to accomplish? And

43:22

then let's figure out this is what we could build and accomplish that. This is what we can buy and accomplish that or not accomplish that, and maybe the decision gets more obvious, but it's just so hard to frame it in terms of completely agnostic to the tool, the thing you want to do. You get to be the bad guy. You you have to say, well, you were going to do an easy job, which is just pick a tool. Uh, and now you're forcing them to go back to the drawing

43:45

board like well, why you know, really really look right? Why do you want your builds to succeed? Yeah? Exactly, But it's a good question, like in the CI case, like why do you want your bills to succeed? Because I want to merge code faster and ship code to production faster. Okay, so your metric is time from commit to development, time from commit to production. Okay, we came up with really great metrics from

44:07

asking guard. I think it usually it seems kind of like, I don't know, weird at first, and you know, contrived, but I think it actually leads to like, oh, okay, yeah that makes sense. No, I'm totally what do Yeah, it makes me think that there's like some Freudian stuff here where I just want to sit here with a pipe and go. But why do you want your bill to succeed? Well? Is it because of unresolved issues with your mother that you feel your build business succeed?

44:38

I mean when you pull individuals into an organization, their personal values do impact what that organization drives as important and sometimes I've known software engineers to be uh, quite illogical and driven by their emotional state to you know, it has to be like this, it's so much better, and you know, you joke, but there is something there. No, that's definitely I mean, I'm probably guilty of that. Yeah. So you work for a observability

45:13

company and you're observing your internal tools. Do you use your own product for that? We do? Yeah, Yeah, trying to dog food all the time and say we measure SLOs internally. One cool thing we've done is we have a bunch of you know, it's not great, like you know, some some shell scripts stuff that maybe shouldn't be shell scripts, but everyone's got

45:36

a pile of that somewhere and it's pretty important in the normal workflow. And it was having a lot of problems, and uh, you know, we just found a better way to measure like what is the success of people that say I want to run a local cluster? How often does that fail? Turns out it was failing a lot more than we thought. And then the buggy. It's really hard because you're like, okay, well could you go

46:00

put a set dash X in this file? And run it again in the mouth, and no one wants to like, no one wants to do that, right. People just want to say, hey, I think broken, I can't you just fix it? Right? So we built all of this in even to our shell right, so like we have like a tracing view

46:15

I think of micro service tracing. Like there's a request I D and you know it hits tny services, Like you run this one shell script that actually is running like you know, ninety three shell scripts underneath it for better or worse or worse, but it's happening, and we have like a trace graph of you ran this and like so the people come and say, man, it's took nine minutes to start this thing and it normally takes six minutes.

46:38

This sucks. Why is this? So we can go in, look up their user name, find with a nine minute run, click it, look at the trace, and go, oh, yeah, there was this bug pulling from ECR or whatever it is, right, and just doing that helped us pin down. It turns out all these problems with the tool were like systemic to one or two things being wrong, you know, poor assumptions being made, and we fix those and we're kind of off to the races on it, and now we're building a prober for it, so we'll have like

47:08

a graph. You know, again, everything has a graph, right, So the same way we have an solo for ingesting an observation in a certain amount of time, we'll have an sol for Hey, this thing can create a environment locally and it happens in less than this amount of time. And then we'll have variants of it, like does it work for people that run

47:24

it over and over again on their box? What about a new engineer that logs into a fresh box and runs it, because that's always a different you know, there's some you know, that's where always the goblins live in these things. Well, my terraform works fine? Oh man, I destroyed it. Have to run it from scratch, you know, does that work? So kind of trying to measure all those different things from that using our tool.

47:46

Is there like an eBPF integration here that you plan on utilizing in the future to understand what the requests are or how the script is running fundamentally on the machine. Yeah, I think I think we're looking at that, and I haven't dealt with it too much myself, but I think that that is kind of the ultimate for this, I would I think marry the two right VPF for the raw like just show me everything that's running and helped me and

48:08

then my injected data. So yeah, I think that is in the future for your internal tools, Like once you identify them, you're like, Okay, we need to bring this up to being like a part of our we need to treat it like it's part of our core infrastructure. How do you socialize that across engineering so that everyone knows that this is a supported tool. This is a preferred path if you were thinking about going and building something on your own for this one use case, we already have you covered here.

48:42

How do you communicate that email to the whole list? Pending your Slack message?

48:49

What could you go wrong? Right? Channel? Baby? You can write documentation all day, but like I think we all know how much our documentation gets read, right, I think I think the way to do it is to this you know, goes on the layer aight problem we're talking about earlier, but like have to build a good relationship with all these teams and have them come to you sooner in the process, like the sooner infrastructure and s E. Folks can be involved in anything, like before code is written

49:17

would be super ideal, so that them come to you and say, hey, I wrote this really cool thing, but it uses mago dB. It's like, we don't do that here, right, Like maybe there's a great reason that does that, but you know, but it's much harder once they've written code, right, Like, you know, it's the poker pot committed things like while I call for the fluster on the flop, I'm putting all my money in on the tourna matter what. It's like, let's do the

49:37

math. Not great, get to them, you know, pre flop, right, Like should you even be in this pot with mango dB? Right? Maybe? Yeah, but that's not easy to do, right, You have to There's not a technical answer there. That's just build relationships, make your team available, make people you know, had interactions people like you know.

50:00

I think it was back to the sort of product management thing earlier that we were talking about, where if you are at the point where you need to tell people about the thing that you're working on, like maybe you didn't approach the situation necessarily in the best way rather than driving it from how you would a startup, which is okay, you know, what do our customers pain points look like? And you know, as users, what do they want? And they're coming to us and saying, hey, when are you

50:22

done with this thing that we asked for? And the implementation details are of course what you're picking because you know that best. But fundamentally it's just a matter of pinging them on whatever RSS feed that they're looking at. Yeah. I worked at a large company before and we had an infrastructure PM and it was amazing. It was just so it felt like it was easy mode, right, Like someone else is going to go gather requirements I don't have.

50:49

Yeah, I just I'll go do some work. That's that's fine. And then they show up, you know, prioritize on a list. It's like, well this is this is awesome, this is what it's like on the other side, right, you just need to plug that into chat GPT and get the answer out as well, and then you can just you know, stop doing all the work altogether. What do my engineers want? That's interesting? Get a slack bot that uses chat GPT to act as a PM role.

51:16

I was just the other way around. Have the pms just answer, like answer the question of what they want to have built, and then it will automatically build it for them. This is what the hive mind Internet wants to build. Probably not so bad if everyone wants to. I don't think you'll get to five nines. I think it's more like five two's will be in the front in there somewhere if you carry it out to enough decimal places. There's some nines. You know, non repeating imager and doesn't really good

51:54

cool, So what else should we be thinking about? For internal tools? That's your big takeaway piece of advice. Apply all the same riggor Like you know, when PROD breaks, you go and declare an incident using your incident management tool. You have a communications role, you have you have the whole thing or you got to run, but hopefully if you don't, you should, And then when you're done, you write a post mortem. You might

52:21

even publish. The customers publish it internally, right, Like why shouldn't an engineer be able to read about why Jenkins broke? If production breaks, you're going to file a bunch of follow ups. You're going to prioritize it over other works because we don't want PROD to break again. Same thing for internal tools, Like it's really I think that's the big takeaway is there's really not much of a difference. I mean, and people, well, if PROD

52:40

is down, our customers that pay us can't do work. Okay, great, if CI is down, how are you going to ship a fix when PROD is down? Right? Like you're like, oh, I brought this, I wrote this really great infrastructure where changes can only go through CI and not handmade. That's a great thing. But if your CI isn't as of the ball has production three nine CI four nine is production. You can only make changes through CI. You know, not a math major. But we're

53:07

going to have a problem here forty minutes out of the year. You know it's going to be an issue. So I think that treating it, I think people just underestimate how important that stuff really is and the impact it can have. You don't have to wait till it. You know. The worst case is production is having a problem. You're monitoring is broken and you can't

53:28

see it. Your CI is broken and you can't ship a fix for it, whether that fixes can FIG or code, I mean, and then that's a really terrible postportum to have to send out to a customer of well, we knew what was wrong, but we had to fix. We couldn't build the dockor image that had to fix because you know our two thousand and four era Jenkins decided it's a crash and I'll start it. Yet, Yeah, that's not It was not a good look for anybody. I mean, you

53:53

identified it. If there's value in doing this activity for some of your services because of what the users look like, then there's probably value in doing it for other services that just happen to be internal. I think that's a big part of post mortems, right, Like if you find a production problem where oh, hey, we had this bug in our database connection pool and we had this reconnection issue and this thing happened. Where else do you have connection

54:14

pools? Right? That should be the logical question. You ask the same thing internally, Right, we had this neglected service. Where else what else is flying under the radar that's going to bite us? And those are the tough post you know, some of the post mortem actually like fix this bug with a reproduced case. Okay, that's like a day of work, right, you know, when you're done because the test passes, it's easy. These are much more ominous, like project the follow ups. But if you

54:38

don't do them, you're going to pay the price. Is there like some obvious pitfall that a lot of companies or maybe even everyone seems to get wrong in this area sort of besides the stuff we've been talking about, well in post mortems, people are extraordinarily bad at distinction between root causes and triggers. Right, Pete type this command and took the site down? Right, root

55:04

cause is not Pete sucks? Right, that's right? Like the root cause is why was there a command to where are the where are the seatbelts? Where are the gates? Where is the code review? Where's all that stuff? Or you have to really really I think the five wise thing is interesting. I don't think you actually have to write down why five times and a documented filled out I think that's a little bit you know, the meaning and not great, but the philosophy of like, really come to the root cause

55:30

of the problem, right, Like root causes aren't well? Aws had an availability zone die? Like, what can we do? You can run more than one availability zone? You can do this, You can do that right, like, and maybe you choose not to at this point, but you should at least identify it and say Okay, like we know the root cause, and we've chosen that this is a risk, and this is why we're

55:53

a three nine service or a four nine service. And maybe someday will make it better, maybe we won't, but at least being honest with yourself about it. Ah, that's huge right there. I want to highlight that because like, just because you identify the root cause doesn't mean you have to do anything about it. Because I've seen multiple instances where companies build infrastructure that is

56:16

far beyond their budget and their actual requirements because they're focused on that. And the analogy I like to use is like, whenever I go to work every day, the fastest, most efficient way for me to get there is buying my own jet copter, But my budget really says I should stick with my eighty seven Toyota Corolla, you know, and so you have like balance those two things, right, right, Maybe the fix is leave ten minutes earlier

56:49

instead instead of mind Yeah exactly. Yeah. Well I think also it comes back to SLOs of like, Okay, we had this big problem, but ay was it a big problem? You know? Like it's hard back to the emotion thing, right, Like if a big customer is impacted, it gets a lot more priority, but you have to quantify at the end of the day, like, hey, this is how many nines we have.

57:09

This is our error budget. We used it during this incident. Maybe that's not good, but that's this is par for the course, and we don't need to suddenly become multi cloud, multi region, you know, all the things load balancing, complexity, because that's the other thing is looking at. You know, you can add nines with complexity, but complexity can also reduce nnes. And you have to be careful over engineering and response to things.

57:35

And I hate to use the term like you know this is always the escalator. Well, if there's an act of God that you know, this managed service goes away, it's like, well, what's an act of God? Is that any going down? I don't know, that's just expected, right, Is it a tornado hit the Virginia area and all of the US's to one one away? Okay, there's an act of God we don't have to

57:53

plan for. But if you're trying to be a five nine service, you're not running in one region anyway, So you know, it kind of all has to that has to make sense together. You know, the whole story has to just kind of flow. And that's where I think some people get off. They they write an SLO, but they don't have a story to back it, or they write a really complex story where they don't need for an solo that's simpler. Yeah, point for sure. But it's an art

58:22

so it's you know, there's no right or wrong answer. I always say solo everything is so technical and solos or this like walfty, what is right? What is wrong? Who really knows? You just kind of have to do it. People always we were rolling out SLOs in a company. How do I know my slo's right? I was like, well, you know, it's not. It's your first SLO. So you build it and you have post mortems and you adjust it as you go. You know, were you at you know, did you have an outage? Yes? Did your

58:51

solo show it no cool? Make it more aggressive? Did you not have an outage? And your SLO says you had a outage? Make it less aggressive? Right? Like it sounds simple, but that's just the feedback loop. And if you're doing it twelve months later, you probably have a pretty decent setup and have a much better idea of what your customers consider an outage or not. You hit on something really interesting there. Actually, So if SLA is your you know, contracted amount and the and the I is whatever,

59:17

your indicator is always just an objective. So it does seem like it's it must be subjective in every way. How do you sort of pick that?

59:25

How do you know what? First off, if you have an A, the O must be more aggressive, let's hope, right, right, But let's assume for a moment you don't have a contractual you know, I say, you know slas are like SLOs with lawyers, right, it's really kind of a yeah, you have to guess, Like I mean, I think that's the unfortunate of it. And that's where you know, hire somebody with experience. They'll be able to guess more accurately maybe, but they'll at

59:52

least also know when they're wrong. So I don't know. It's a process where you just have to iterate and sometimes you have a major miss You're like, oh man, we really thought we were measuring this service and we had this massive outage in our dashboard. Was like everything's green, nothing's wrong, and users are entre It's like, okay, well we missed this super critical part of the picture, and then you do the postmortem thing and you figure

01:00:15

out, Okay, well I made this common mistake. What other of my solos have this common You know that I make this mistake more than once. Would it be fair to say that it should be meaningful so that if you're violating it, then you're taking some action as a result, and if you're not, then you don't do anything. So maybe it's about finding that sweet spot where it causes the right thing to happen in your organization. Absolutely, I mean I think that. I think to get there right, you need

01:00:42

to have some kind of alerting and reporting around it. I think alerting on solos is like a very hard problem. Like the Google book will talk about burden rate alerting. Have fun implementing that, right, that's very hard.

01:00:52

But if you can get there or have some approximation of it and report on it, like I'm a big fan of getting you know, it doesn't work right away, but eventually maturing to a point where every alert has a ticket, right and the ticket is either fixed the thing that caused the alert. You know, this needs to be more resilient, This needs more replicas or

01:01:13

this alert page mean nothing was actually wrong. We should fix the alert and if you do that over time, like that's how you can get solos and cause change, and when they really are broken, the post mortem loop is the real fix. Like the trick with SLOs is you kind of I don't know. There's that perpetual battle with infrastructure and product right You're like, hey, we need you guys to write more stable code and products, like we need features, we need this needs to be a different color, right,

01:01:36

which is fine, like that needs there's a balance there. But if you get everyone to agree on SLOs and you're really user focused, like, hey, users are happy when the latency you know p ninety nine latency is this, and you just are sad when it's over that, and everyone everyone,

01:01:50

like business engineering management, agrees on that. Your argument is a lot easier when you have an outage to say, our users weren't happy, Like objectively, our users weren't happy, So we either fix the thing or we decide that our threshold was incorrect on what is a happy user. But one of

01:02:07

the two has to We can't do nothing. We can't just write features and say, well, users might be unhappy again, because we know this thing will blow up, and I find that that is a good you know, you almost trick the trick product into it and like, oh yeah, so those these are great, and then later it's like, oh, yeah, I guess we have to fix this now, so you know, we'll give we'll give you a sprint or you know, whatever it is. You know,

01:02:30

I did the same trick a long time ago with okayrs. Very similar thing, where you know, once they agree to the okay ares, you try to set the mindset up what does this actually mean? You know, why are we setting it? And you set the so we can know what to do when we know what to do the right thing, and then later you can just point back to it and be like, hey, you know, we decided what the right thing was going to be in this situation. Now it's time to execute on it. Or you know, you have to

01:02:53

make a trade off. As you said, Yeah, it's a tough. You know, you can't sell people problems with tech, but you try. That'll be the million dollars million dollars you know thing. But I don't know,

01:03:10

I feel like data is always helpful. You know, emotions are bad, you know, everyone has emotional arguments, but data is hard to People still argue with data, but if I feel like if you're out the side of data, you at least have a fighting chance of pushing for change, or how do you know, like you ever feel like you're in a situation where you get scared of survivorship bias, where you even if you're doing root

01:03:37

cause analysis and you're finding really what the underlying problem is that even though you're going on and fixing it, that you're missing some other side of the iceberg that is waiting out there to come and crush you totally. But I mean I think that I don't know. I just accepted it, right. It's a pestimistic, sary brain, Like everything will break. Everything I fix will also break when I leave a company, everybody will blame me when it breaks.

01:04:03

And that's whatever, you know, Like I've just I've internalized it and accepted it. And it used to stress me out a lot more and now it's just like, Yep, this thing I'm rolling out might break, and you know, at least you know that if it breaks, you're going to write a good postmartum and learn from it. And that's kind of the Constellation prize of like, yeah, it sucks to write a big Postmartum after a

01:04:25

big adage. But also you know, that's that's the gig, and like we shouldn't have hired Pete, that's that's the R. Well, you know, like the first place I ever with FedEx, right, there was this meeting every week, the R squared Meeting, the Redundancy and Reliability Meeting interesting and it was like how good of a week is the VP having? Right? And there was always this rumor that like, oh, somebody was fired once at this meeting because they had an outage, and like no one can

01:04:50

actually tell you that person's name, what year it was. I'm pretty sure it was, you know, it was trying to hype you up to prepare for it. I think it was. I don't know if they intentially did this or it was just a grow but like, that's such a terrible you know. I think the blameless culture is big people can't be a problem, right, But I think often it's not the person that made the change that is actually the problem. It is someone's reviewing code, right, Like code

01:05:17

review should be first class, not a checking a box. I feel I feel really bad when to change I reviewed cause an adage that happened recently. I was like, who approoved has changed? Oh, never mind, it should have done better due diligence, And yeah, I mean it goes back even further than that though, because if you do think that it is one person's responsibility that caused the problem, you can look at, well, you know, what was the culture we had that allowed them to make the mistake?

01:05:43

Or you know why were they even hired? Right? Were they a good fit for the role in the first place? And you can definitely go back up somewhere else or a different chain to really dive in there some of those things you might not write. It's like a MetaPost mortem of the Yeah, for sure, how do we get here? Right? How do we end up hiring people that understand networking for a networking product or whatever? You

01:06:01

know it? Right? Yeah, I think that's suff's important. And then then change the process, right, Okay, these are a new criteria for this for sure. Yeah. Yeah. When I was in the Navy, we had this process for getting your certification for whatever job you were doing. That I've tried to bring into poor request reviews that it takes a cultural shift to get fully implemented. But in the Navy. It was set up so that, like I was a nuclear engineer, so you had to learn all

01:06:30

these different skills to operate the power plant. And so you would go around the power plant and work with the existing engineers and show them that you knew how to do something, and if they felt like you understood it, they would sign it off in your book. And this was way back pre computer stuff, so you actually had a physical book that you carried around, but

01:06:55

you signed it off in that book. And then if at any point in your career you ever screwed that task up to a point where your skills were called into question, they would open up the book to see who signed it and go back to that person and say, hey, why why did we'll screw this up? And and so it was that like that. It did put that sense of pressure on you so that before you would sign off on anyone's book on anything, you wanted to make sure that you were reasonably confident

01:07:26

that they actually knew what they were doing. And so I pre poor requests the same way like if I approve a poor request and it breaks something, I don't consider that to be a fault with the person who submitted the poor request. I consider it to be my fault for not catching it in the review. This is like the Hurdos number corollar ate the uh, there's the you know, you chase it back all the way up the chain, like, well, you know, who did that person? You know, what

01:07:51

does that person's book look like? The reviewer? Right? No, I think that's a good, good way to look at it, and it's just good for accountability and you know, yeah the meta issues. Yeah, and like a one off instance, you know, is not that big a deal. But over time, you know, if like everyone that I signed off on this particular skill is having problems, that's going to point back to the root cause being me not actually understanding either what the skill is or how to

01:08:33

evaluate that skill. Yeah. I've been places where the root cause has just been fatigue. It's like the call rotations are too insane. This was at the end of this person covered someone else's on call. They were on two weeks of twenty four by seven on call. They had had a bunch of major incidents overnight, they were running on no sleep, and they just simply did the wrong thing. You can't fault the human for that. Why do

01:08:58

we put them in this grinder like every other industry. Airline industry is a great one, right, you can only have limit limits on how much you can fly and be responsible over people's lives. I mean, obviously when lives are instakes, like you said, you have to be more rigorous. But it doesn't have to be that rigorous for you know, being on call. But you know, we have an informal policy on our team is if we take overnight pages, someone will offer to cover the next day, the next

01:09:20

night so that person can catch up. Yeah, there's nothing worse than having a week of terror where you're losing sleep every night like you're just you're dead by there, and then a major incident. You know, the worst timing always happens in these things, right, Like the worst adages are never one

01:09:35

thing. It's a confluence of events. Right, So that terrible outage is going to come Friday when you're running on fumes and your brain isn't isn't there, and you're going to make mistakes, you're not going to see problems and

01:09:46

that's not your fault. But that's tough to in a startup world like oh yes, you're meant to grind, but there has to be some reasonableness of Okay, the people with responsibility are keeping the sight up need to be aware and awake because we're not just running run books where we copy and paste off. It's the run book is use your brain. Yeah, I think we really realized over the at least the most recent decade that the grind is not

01:10:14

helpful. Even like doing more hours of work, especially in knowledge work industry, does not translate to an additional value. And so if you do have outages every incidence, every day that people are on, I like, it seems like fundamental that you would intentionally rotate them off so that someone else is there because and you know, I think there's like a pride issue here where the engineer just wants to stay on because you know, it's their rotation and

01:10:40

they don't realize that it's actually harming the company. Like they should speak up for the benefit of the company. It's not about them necessarily. Right. Heroics shouldn't be I mean, it's great sometimes heroics just have to happen, and they happen and it's good, but that shouldn't be the goal of like, oh man, I want to be a hero of the Saturagye, please know, like it was really fun early in my career, and now it's

01:11:00

like, oh, there's heroics happening. This is awful. Yeah, it's it's the cult of the hero because then you you fulfill that hero role and then you know, you get at mentioned in slack and you know, everyone's like, oh wow, that was such a tough effort, you know, and the worst yeah, the worst word to use the NHR rockstar. Yeah. Absolutely. I don't want to be a rock star at work, man, Like, please know, I want to do I'm going to be a rock star. I want it to be on Motley Cruz, fueled by cocaine

01:11:30

and hookers. Let me be a rock star, like being a rock star and infrastructure is like, yeah, where's my hotel room to tear up? Yeah, I mean at the beginning it was oh they described me as a rockstar, that must mean I'm doing great. And now it's like, oh, now it's like what your mouth? Yeah, don't call it that. But I think I think the culture is Yeah, I don't know. It's

01:11:54

it's a tough one to just you don't want to actively discourage you. You don't want to say long off of the home, like if you want to work. I don't know. My thing is like I'm a workaholic, and I always say I'm happy to work long hours if it's what I want to do. If I want to work on Saturday, awesome. If other people want me to work on Saturday, that kind of falls apart from me. And the social engineering trick is that, well, just to keep work that

01:12:16

he really likes and it's passionate about, and he'll work on Saturday. That's a fine. Okay, that's a good trick. It works on me, but I don't think people have used that. And some weekends I game all weekend. Some weekends I work all weekend. But usually it's it's my choice. The point about that, I think part of that is that we are creative in the work that we do, you know, using our creativity to

01:12:43

solve problems. And creativity doesn't doesn't show up at nine am when you hit the time clock, you know, and if it's something that you're excited about, you you get this, you know, that creative bush like, oh, I got to go do this, which is what leads you to you go in and work on Saturday. Most of my best work is done yeah, after midnight or on the weekends correct, Like it's exactly that is. You get the idea and you're like, man, you're laying in bed and

01:13:11

you're like, I can build it this way, that way. It's like past draft to build it right, Like I know. It always fueled me was there was some really annoying problem that I did, like some other problem I didn't want to have to solve, Like someone was asking ridiculous things on how to fix Jenkins and the solution was something I didn't want them to do, and so I felt the need motivated to just have this problem completely go away. And that's when I would really work on things like non stop wast

01:13:39

nerd sniping. Right, someone can't be done see you Monday. Right here it is, but I don't think. I think the other problem with that is it has to be clear on the team that that's happening. Like some people just don't ever work weekends. I think that's great, Like you know, that's that should shouldn't be a problem. So it shouldn't be a peer pressure of like, oh Pete work of the weekend, everyone else should.

01:14:04

Like I think that is a terrible message to send, so you have to be careful with it, like I've learned not to send too many emails. Well, I don't. Luckily the company I work at it's not an email company, but I like big companies where email is life. I'm always very careful about setting emails on the weekend because you're implicitly setting expectations for other people

01:14:24

that are watching, and I feel like that is trouble. Yeah, I mean that's another avenue realistically, Like, I think there is a thing about just like you want the load on your systems to be constant. You don't want to see spikes because they're incredibly hard to deal with the same goes for teams that are putting out work. Right, If some engineers are incredibly spiky on load, then you're unpredictable and what you can deliver and how much,

01:14:50

and the reliability or quality of that. So it's you're not necessarily doing anyone a favor by one day going out and solving a problem. If that's your pattern. Yeah, I think it's a difficult lesson for a lot of people

01:15:02

to learn. I struggle with it a lot. I often find myself like I'll work a weekend because I want to, and then Monday, I'm like, oh, man, I don't want to work today, Like, yeah, well that's fine, right, because then you're then you're still having the same amount of work that you're sort of putting out and productivity for the team, but you're not overburning them because someone will has to review that, right, someone that's still creating followup work down the road, and depending on the

01:15:27

statement, maybe that creates incidents as well. Yeah, now, you're right, it's a very tough balance to figure out that. Yeah, I'm still trying to figure out what that is. I would say my whole career is pretty spiky over There are some places I work at where I I'm always you know, I can't help the work aholic at me. But there are some places where I chill a little more in some places where I'm like eighty hour weeks insanity, And I don't know. For me, I've learned it's a

01:15:51

healthy cycle. Like every couple jobs, I do the insane grind because that's just where I'm at, and then I have a couple of years of like less insane grind, kind of relax and find other things to do, and then bright back to it because I miss it, you know. I mean, it doesn't matter what your pattern is. I mean, if you are

01:16:08

spikey, it's fine as long as it's somehow consistent. Right. You know, if every couple of weekend, you know, every other weekend, you do extra work, then you sort of expect that into the realm of things and how that's going to play out. But if it happens and it's unpredictable,

01:16:20

then you don't know what the impact is on the team. You may think that the team can have more work done than is reasonable, and so a new big project comes out and now it's taking even longer or unexpected because he's not pulling those weekends anymore and doing the a real job that has a lot of value, but messed up some sort of prediction or timelines or deadlines. I think engineering time prediction is like the hard I just throw that out

01:16:48

the window, especially with infra. Right, well, it's two weeks of infra work. Oh so it'll be done in two weeks, maybe like two weeks, And for work sometimes takes two months with outages and interrupts, and it's hard to explain sometimes. Yeah, it's like Einstein's theory of relativity. This is real the fast where we go this is different where we go no. But I think going back to your you're talking about working in spikes. I think that's like, that's how we've worked as humans for you know,

01:17:24

forty thousand years. Like you go out and you do the big grind to you know, to to hunt the animals, and then you go back and you just your rest and you relax for a while. Or you go out and you work all summer to plant the crops and harvest the crops and then store them for the winter, and then you ride the winter out. So I think that behavior is actually something that has been native to us for a long long time, and to try to break that in the course of a

01:17:54

three decade career is going to be difficult. I mean you're on as a there definitely what are you called it? Art? Right? So you look at the Renaissance artists, famous ones like you know, see what they did? What are other artists doing before? And even today? You know, how are they working? Because that's the same of expectation you can have for any knowledge work, which is very similar to a creative process. You have

01:18:18

to have the right motivation and sometimes that's really hard to figure out. I don't know what that is sometimes, right, some weekends, I'm just I want to be at and stare at a screen. And some weeks I wake up early on a Saturday and I'm like, let's write some code, let's do stuff, And yeah, I have no idea what what drives it? Sometimes I know, but often it's just I don't know. It's hard to say how I'm feeling or will feel until I actually wake up. And that's

01:18:44

what to do for sure. I think if we figure that out and can define it and reproduce it, we've got our next multi billion dollar startup. Seriously, awesome. Is there anything else we should talk about for internal platforms of infrastructure? I think we covered a lot. I think that Yeah, all the things I want to talk about awesome. Cool. Let's do some picks. Warren have been picking on you for picks the last couple of episodes, but I gave Pete the heads up before we started recording, so I

01:19:20

know he's preps. So I'm gonna put Pete on the spot. What'd you bring for us? Pete? I'm a gamer, and I like, uh, I don't know. I like games that are really hard. I like games that are kind of like a second job, and grindy and I don't know, I'm go utton for punishment. I guess this is the summary. Some people play games just to have casual. I have some casual you know,

01:19:44

hang out with the boys and play some games kind of thing. But I like spreadsheet on the second monitor and gaming on the first monitor kind of thing. So I really like aarpg's you know, hack and slash anything where you can mind max. It's just interesting. Up my alley. A new game came out. I don't know. I think I went full release ten days ago. It's been around for years, and like betas and Alpha, it's called last Epoch. It's like a you know, Diablo ish ARPG thing.

01:20:11

And there's Diablo. I played a lot of Diablo. There's Poe, which is like the ultimate. Like I don't have enough hours to I'm also like a very addicted personality, so I have to choose my gamescre I don't play a factorio because I know that I would just stop working. Like Last Epoch is this nice middle of It's a lot more complex than Diablo. There's a lot more things that can contribute to your final build that you can nerd out on trying this and trying that there's a lot of RNG, so there's

01:20:36

the grind aspect of it. It just checks all the boxes. And for me, I know a game is good when I lose track of time constantly. Oh it's a very am I should go to bed now. That's been like the last week and a half as the game came out, And so that is my pick. If you like ARPGs or min maxine games or things like that, this game just checks all the boxes for me. Written by gamers. Sometimes you play a game when it's pretty clear that the people that

01:21:04

wrote it don't actually play the game they've written. It's actually way more common than you think. And you realize because there's no quality of life features and it's like, well this is awkward and I have to do it every two raids and this sucks this game just like oh I have to do this thing. Oh wait, this is really easy to do, Like okay, just like someone plays the game wrote the game. So yeah, I enjoy a game like that. It's a smaller the company. So is there an online

01:21:30

multiplayer mode for it? Yeah? So it's like a whole way do you go to the common areas there's a bunch of other people there, and there's trading and stuff, and there's you know, public chat where half trolling, half people asking questions. But you can party up with your buddies and tackle

01:21:46

content together. And so I have a small group I play with and we've always played ARPGs, so this is our current one that we're all just kind of grinding up characters on and figuring out what the best builds are and what synergizes with each other and things like that. All right, So the follow up question is do you want to share your gamer tag so that you're listening for the show can jump on and talk trash. Yeah, I'm pdf backwards the TEP all right, someone out there has p F places. So I'm

01:22:15

a yeah, you said difficulty. I thought for sure you were going to bring up Lost Souls or Dark Souls, and so I that kind of game I if I was good at I would play. I don't think anyone's good at it the whole, like the the you know those kind of games where it's like jumping mechanics and stuff. I don't know, I'm that's stuff. He's not a platformer. That that's it? Yeah, I mean really twitchy

01:22:38

games. Yeah, I'm totally with you. I played Ninja Guide in a lot in the past, and like that was very there are some parts that were incredibly challenging. Blessed Podcast. Enough of that, Like some of the boss fights are legitimately challenging. You know, we just one dude we bought probably forty times in a before we beat him, but now were really good at it because we yeah, we post bared them each death. Okay, you can't stand in the middle if really where games are worth right, Like

01:23:05

it's I think that's a real lesson. You know, you really have to take root cause analysis and post mortems to your your private life, and you know in your friend group when something goes wrong, you really need to investigate. I often joke that, like many things about me are really good for work and really terrible for personal relationships, but they work out really well in gaming. So we're causing relationship problem is usually a bad idea. We're causing

01:23:30

why you died to a boss. Apply it where you can fair enough, all right, Warren, would you bring this week? Yeah, so a couple weeks ago I mentioned this already, but I'm gonna plug it again. On Friday, there's a decompiled conference in dressed in Germany, UH which I'm

01:23:51

actually giving a talk on about our journey at authors and adding security. But there are some interesting talks there, like there's one that I really want to go to that's about migrating from kubernetties to server lusts, and I think there are a bunch of other ones that are really interesting. I'm looking forward to nice, excellent, cool. So my pick is going to be no surprise to anyone who's been listening to the last few episodes. I am picking platform

01:24:16

Con coming out in June. It's a five day virtual conference about platform engineering, so check it out totally free, and there's going to be tons of great talks there and specific to me. At the end of the conference, I will be doing a live Q and A session with some of the speakers, so they're finalizing who the speakers are going to be, and then once that's done, I am going to try and turn this into a Q and

01:24:44

a session that you actually want to listen to. So I'm going to go out on X and start asking people which speakers do you want me to interview and what questions do you want me to ask them, so that it becomes the interview that you want to hear, and yes, I know that by going out on X and asking you what questions to ask, some of you are going to ask some of the wrong questions. So I'm just gonna be honest here with you. I'm not going to ask that, but thank you

01:25:12

for listening to the show, you sick little pervert. And so whenever the Q and A session comes up, I will ask the questions, some of them because we know what you're going want to ask, but I will ask some of them and then you will be able to say, hey, I heard about this on the X, which takes it full circle because I did go through that whole setup just to walk all the way through and plug his zz Top song in my pick for today. So all of you listeners who

01:25:43

are Zezytop fans out there, that was all done for you. Thanks for listening. Cool, So I think we got an episode here. Awesome. Yeah, Pete, thanks for joining us. This has been a great talks. Fun to have you on the show. Yeah, thank you for having me again. This was great. Yeah, anytime. Warren, thanks again for joining me as a his co host here. Of course, I love having you on the show. It's been a lot of fun. Look forward to seeing you next week. Yeah, and for all you listeners out there,

01:26:15

and we will see you all next week too. Thanks everyone,

Transcript source: Provided by creator in RSS feed: download file

Navigating Work Patterns and Internal Tool Reliability in Engineering Teams - DevOps 197

Episode description

Transcript