#59 - DevOps Solutions to Operations Anti-Patterns - Jeffery Smith - podcast episode cover

#59 - DevOps Solutions to Operations Anti-Patterns - Jeffery Smith

Oct 11, 202152 minEp. 59
--:--
--:--
Download Metacast podcast app
Listen to this episode in Metacast mobile app
Don't just listen to podcasts. Learn from them with transcripts, summaries, and chapters for every episode. Skim, search, and bookmark insights. Learn more

Episode description

“DevOps is about creating a collaborative environment between the development team and the operations team, and aligning goals and incentives between those two teams. Because so many of the problems that we encounter in life, not just even in technology, are due to misalignment of goals."

Jeffery Smith is the author of “Operations Anti-Patterns, DevOps Solutions” and the Director of Production Operations at Centro. In this episode, Jeffery described DevOps essentials and emphasized what DevOps is not. He also explained about CAMS, a framework that outlines the core components required for successful DevOps transformation. We then discussed three anti-patterns taken from his book: paternalist syndrome, alert fatigue, and wasting perfectly good incident; and he explained how to recognize those anti-patterns in order to avoid them on our DevOps journey. Finally, Jeffery also talked about postmortem and shared tips on how to cultivate a good postmortem culture.

Listen out for:

  • Career Journey - [00:04:47]
  • DevOps - [00:09:13]
  • CAMS - [00:12:42]
  • Why DevOps Anti-Patterns - [00:16:48]
  • Anti-Pattern 1: Paternalist Syndrome - [00:19:55]
  • Anti-Pattern 2: Alert Fatigue - [00:27:20]
  • Anti-Pattern 3: Wasting a Perfectly Good Incident - [00:34:33]
  • Postmortem - [00:39:59]
  • 4 Tech Lead Wisdom - [00:45:57]

_____

Jeffery Smith’s Bio
Jeffery Smith has been in the technology industry for over 15 years, oscillating between management and individual contributor. Jeff currently serves as the Director of Production Operations for Centro, a media services and technology company headquartered in Chicago, Illinois. Before that he served as the Manager of Site Reliability Engineering at Grubhub.

Jeff is passionate about DevOps transformations in organizations large and small, with a particular interest in the psychological aspects of problems in companies. He lives in Chicago with his wife Stephanie and their two kids Ella and Xander.

Follow Jeffery:


Our Sponsor

Are you looking for a new cool swag?
Tech Lead Journal now offers you some swags that you can purchase online.
These swags are printed on-demand based on your preference, and will be delivered safely to you all over the world where shipping is available.
Check out all the cool swags by visiting https://techleadjournal.dev/shop.


Like this episode?
Subscribe on your favorite podcast app and submit your feedback.
Follow @techleadjournal on LinkedIn, Twitter, and Instagram.
Pledge your support by becoming a patron.
For more info about the episode (including quotes and transcript), visit techleadjournal.dev/episodes/59.

Transcript

The box is a style of working. It's about creating a collaborative environment between the development team and the operations team and aligning goals and incentives between those two teams. Because when you think about it, so many of the problems that we encounter in life not just even in technology is a misalignment of goals. Hey everyone. My name is Henry Surya be Robin.

And you're listening to the tekhelet Juno, the show will be bringing you the greatest technical leaders practitioners and thought leaders in the industry to discuss about their Journey ideas and practices that we all can learn and apply to build a highly performing technical team and to make an impact in your personal work. So let's dive into our Journal. Hello, everyone. This is Henry. 201. Welcome to another episode of the technology, you know,

podcast. Thank you for tuning in and spending your time with me today, listening to this episode. If you haven't, please follow technology, you know, on your podcast app and social media on LinkedIn, Twitter. And Instagram also consider supporting the show by subscribing as a patron at technology, you know, dot, f / Patron, and support me to continue producing, great content every week.

There were UPS as a culture and practice is one of the most widely talked about in the current technology landscape according to the state of devops report Elite and high devops performing. Companies are leading the industries in terms of organizational performance and continue to outperform the companies that do not practice devops optimally. And some of the reasons companies do not practice devops optimally are due to the misconception of the whole Dev, Ops concept as a culture and

also some anti patterns. That may get adopted unconsciously to shed more lights on devops culture and practices. For today's episode. I'm happy to share my conversation with Jeffrey Smith. Jeffrey is the author of a book titled operations and the patterns, develop Solutions, and he's also the director of production operations at Centro. In this episode Jeffrey, describe devops Essentials and importantly a size. What devops is not? He also broke down and explain

about cams. S-see AMS. A framework that outlines the core components required for a successful develops transformation. We then discuss three anti patterns that are taken from his book. The paternity syndrome alert fatigue and wasting a perfectly good incident. Jeffrey explained, how to recognize those anti patterns and how we can avoid them on our devops journey. And finally, Jeffrey, also talked about post-mortem and he shared some tips on how we can cultivate a Post-mortem culture.

This is such a fun conversation with Jeffrey. And I really, really enjoyed it. Especially discussing about those three devops and the patterns that I could personally relate from my own experience. And I believe that you would highly enjoy and learn a lot from this episode as well.

And if you do consider helping the show by giving it a rating and review on your podcast app or share some comments on the social media channels, those reviews and comments are one of the best ways to help me get this podcast to reach more. Listeners. And hopefully, they will also benefit from all the contents in this podcast. So let's start our episode right after our short sponsor message. Are you looking for a new cool swag taglit Journal.

Now offers you some swags that you can purchase online. These wax are printed on demand based on your preference and will be delivered safely to you all over the world where shipping is available. Check out all the cool swag is available by visiting technology, know the dev slash shop, and don't forget. The bracket self once you receive any of those tracks. Hey everyone, welcome back to another 50 Journal podcast show. Today. I have with me an author named Jeffrey.

Smith. Jeffrey is an author of the book called operations and the patterns develop Solutions. So I guess it's like using develop solutions to overcome those anti patterns. So today we will be talking a lot about what are some of the anti patterns, for operations,

people, or in the devops world. We're going to be talking about Some of the things that maybe you should try to avoid in your operations team and how we can actually do humor strategies that actually can improve how you manage your systems, how you operate your systems? And so that using the devops solutions we can overcome those things. So thank you so much for spending your time with me today. Jeffrey. Hope you have a good conversation with you today.

Awesome here. Thanks for having me and I'm looking forward to it. So Jeffrey in the beginning, maybe if you can introduce yourself telling us more about your career, any highlights or turning points. Sure. Yeah, so as you mentioned, my name is Jeff Smith. I'm currently the director of production operations on a company called Centro in Chicago, Illinois here, the United States. We are A rad Tech platform, which didn't sound. Like the most exciting thing to me.

When I originally took the job ad Tech is like, the bane of the internet, but in reality, it is, what fuels the internet and keep so much of it, free. I'll talk a little bit about more about that later. But prior to that, I was with a company called GrubHub, and they're in the food. Delivery space, my time to grow up, was really when I got first introduced to devops and do It's Concepts.

It was the first place. I could really sink my teeth into it and experiment in my career, really started. Twenty odd years ago back in Upstate New York where I grew up, I had been working doing data entry at a local health insurance provider in the area that I'd always been interested in computers, but I was a terrible high school student gentleman, who was the manager of operations? Who would later become my mentor walked by and saw me reading Richard Stevens TCP IP illustrated book.

I don't know if you know that book, but it's like the networking Bible, what? Didn't know at the time. Was that book? Was way over my head. I was reading it and I was getting some of it, but it was pretty dense stuff. He saw. And reached out was like, hey, what do you read in there? We talked a little bit about it, and we would have conversations every now and then eventually an operations position to opened up his team and he said, hey, would you like to switch careers?

Stop doing data entry and do some computer stuff. So I was like, yeah sure. So that's really what kicked off this 20-year romance with tech. I'll say so I was at that company for about 10 years. I worked my way up to operations manager eventually. Seating him, you know, it's one of those moments where you're like, wow is every place is messed up. Is this one? Maybe I need to Branch out and see what else is out there. My wife, what? Girlfriend, at the time.

Now, wife said, why don't we open up a search? Why don't we look anywhere instead of just looking in our local area? So that's what brought us to Chicago. I got a job in Chicago switch to a couple different companies. Had some success doing performance tuning for a company called Accenture and one of their like lab environments, but that was right during the start

of the financial downturn. I'd been brought out as a contract to hire and then as soon as I got brought on the financial crisis hit so they weren't really hiring and they kept extending me. Extending the extended me. And then finally, I said, you know what? I need health insurance, right? Like I'm walking around here. I bumped it away from Financial ruin. So I started looking for a new opportunity in the day.

I got a new job was the day. They told me that they were going to be able to extend my contract anymore. And I was like, oh that's funny because I was gonna have the city meeting with you to tell you. I was leaving. So everyone was of the but, you know, work the couple Jobs before getting to the pub. I think GrubHub was a huge turning point one. It was the first time I'd work for a company that actually made

a product. I cared about and you never really think about how important that is. So often you're just stuck in this field. If you think about it technologist, a really mercenaries, they don't really care about the field that they're in. They're just like what language you're my coding in the field or whatever. The company does is secondary to.

That was the first time I worked at a company that actually made a product that I've cared about the really fueled my interest in how things truly work, not just from a computer. Active. But from a business perspective as well. And that gave me all types of insights that I was lying to before. Because, you know, before that, it was like, I don't care about health insurance, right? Your life, the inner workings of

health insurance. I think care about legal tax software, which I was responsible for running in a company called Wolters kluwer. I just didn't care about those things. So I didn't have a vested interest in understanding how they work. Being a grub hub and being curious about how things work. Open me up to this world of possibility when it comes to operations, so I carried that experience with the into Centro and First, I didn't care about how I attack worked.

I knew that it was important and could be career change and to really understand the business. And I'm glad I did because that took is way more fascinating as much as it is problematic from the various different perspectives, but it is definitely needed in the internet age because people don't want to pay for Facebook. It's the trade-off. So how do we do that and do that in a respectful manner for our

customers clients and users. So that's a quick recap of 20 plus years of experience, but it's been a fun ride. Thanks for sharing. Your story is very interesting. How you started your journey? So through this book, TCP IP. I have to agree that. Yeah, that book is also way over my head. It's a great book. That is like absolutely Bible. But back then, at that time. I was like a good prop was a really good problem. So over these 20 years Journey. I'm sure you have seen a lot of

things right in the beginning. You also said that the messy things of operations administration's, and now, in this era of devops. Maybe you can If I first of all the audience and listeners hear what is actually devops. So the first thing I'm going to describe is what is not. It is not a role. It is not a job title. That's a major pet. Peeve of mine. I think it might even be a little detrimental in my hiring at first because people are searching for that.

Devops rolled pops is a style of working. It's about creating a collaborative environment between the development team and the operations team and aligning goals and incentives between those two teams, you know, we say, Dev and Ops and focus on those team. So it really can be any grouping of Team. It could be Devon Finance, option, finance, options, security, and we keep coming up with all these acronyms, that's a cop's Finn. It's just like, let's just call

the devops. We understand what the idea is. I always say you would never post the position for an agile engineer. That sounds insane. You would never do that because actually the style of working, same thing with that mops. How do we build cohesive joint incentives? For teams to work towards a common goal and putting that sort of front and center. Shape a lot of your underlying decisions moving forward because when you think about it, so many of the problems that we

encounter in life. Not just even in technology is a misalignment of goals. You set that it's not a role, right, but I agree from this part of the world. I keep seeing also devops engineer role. What do you think should be the name of the title? Is it like a sorry, as some people actually chose to prefer? I think that many system Engineers system administrators. Whatever you want to call. I think there are enough people that have been in these proles previously better doing the same

things that we're doing today. It's just with a different tool set, right? When we made the switch from Unix to Linux. We didn't really have to come up with a new job description. It was the same deal. It was just, you know, now there's a new tool when we make the switch from python to go. We don't come up with a new job. Title is just an asterisks under the software developer title. I think devops is the same. I think we have plenty of titles

that could be used. I'm not naive to Could forces though and I know that if you have a have Ops title, it's probably an extra 10 or 15 thousand dollars a year. So I don't begrudge anyone using that. In fact, I tell my team that I manage if you want to call yourself a devops engineer on LinkedIn, that's fine. I'll pack you. But that's not what we're called here. Because the minute you give someone a title like that. It becomes their job. The minute. You have a QA team.

Guess whose job quality is Right? Language is a very Finicky thing because it does so much to the way we perceive the world just by giving something a name or title. That's why I try to avoid it so that everyone knows is like this devops things were talking about, is everybody's job and we need to think about that too. From a times perspective, right? Because we can quickly fall into a trap where we're like all we're building operations stuff.

Nobody else needs to see this. Well that kind of runs counter to this collaborative environment that we're talking about. So maybe we do need to make sure that anyone and everyone can see that maybe You to make sure we're keeping the bodies out of the Ops Code so that we don't have to lock down the repo. Some configuration file that they're like, we can't share that with anyone because there's too much dirt in there. There's just way too many dead

bodies. We've got to keep that under tight lock by putting that mindset forward. Hopefully, to helps to prevent and eliminate those sorts of traps. Thanks for giving that clarification. So, in your view, I think you mentioned is in the book as well. There are few things that are very important. When you want to adopt this devops culture, right? This new style of working. So it's an acronym. I'm called camps or some people actually call it calms.

See am SOC a LMS. So maybe you can give us a light here. What is cams? So current is really dislike framework around devops as a concept to think about the core components that you need in order to make this transformation C is for culture. The A is for automation. The AL that you alluded to that. Some people call. It is lean metrics and sharing or the lasted. So culture is like one soul.

Much of devops is cultural and you need to build a cultural environment in your organization where these sort of practices and concepts are embraced, where people are free to experiment, without fear of Retribution. If they get something wrong, there are all these small. Insidious things that happen in our culture. That especially us as engineers and technologists might gloss over. Not realizing what an outsized impact that they have. Have you ever worked for a company where the culture?

ER, is if you don't use your budget by the end of the year, you lose it, and you won't get it next year. That's a cultural thing. It doesn't have to be that way. There is no golden Finance rule. That's just the way people operate. So what does that do? That creates a culture? Where everyone is spending aimlessly? They're spending a lessly at the end of the year to make sure that they use that budget and I'm sure the people in finance are like sweet.

That's exactly what we wanted to happen. But you created that culture through a small rule change. So culture is just something that we always have to keep from. It's better when we're talking about devops automation is really the thing that powers this. One of the things is you're always going to be asked to do more with less. But the other thing that we need to do is the more automation that you have. The more empowerment. You can give to other people in the organization.

It's this thing that in the book. I talk about this idea of exporting expertise. So it is a technical dance to fail over a database. I don't know if you've ever had to fail over production database, but like, everyone's been in that company where it's all, we got to feel the database over, get bob bobs. The only one that can feel the database server. The minute, Bob turns it into a script where someone just has to execute fail underscore database.

Bob has transferred, a large chunk of his expertise into this automation script and now he's empowered. Tens of people that can do this thing. So automation is key. I'll talk about lean briefly, but I'm not a huge fan of adding the elk. Lean is just about operating dynamically and quickly. I don't know that it falls in the category of devops because typically, you're adopting a All that is similar to the

organization. So like if you're forced to adopt lean but during a waterfall shop that can be problematic metrics, devops should be rooted in data as part of that empowerment. People need to know if the thing that they're doing is actually having an impact. We always talk about it with systems and computers and whatnot. But really extends everywhere. Right? As we are running, our ticket queue is actually performing the way we want to perform and what

metrics are? We judging that success or failure by? We should be able to objectively point to success. Or failure and we do that through metrics and then sharing. You know, how do we again back to that expertise exporting? How do we share knowledge? How do we share access when appropriate? How do we share responsibility? How do we make sure that we're all sort of on the hook for the same thing and we're all contributing to make that thing better. Another thing that's sort of

front and center. So these things sort of collapse together to build this framework and how we should think about approaching these devops Transformations. And when you look at, Nations that aren't doing well in their transformation. You can usually find behaviors or actions that tie to one of these five categories. I keep leaving out lean. I'm sorry for all the lien fans. I'm not, hey, let me pay you in a little bit. So thanks for sharing. I really like the concept

exporting the experts. So these things about automation sharing your knowledge. I think it's a great thing especially in this devops culture. Let's go into the anti patterns that you covered a lot in the books. I think they are probably 10. No, 12 kind of anti patterns so we can start with the favorite one, probably. But in the beginning, let's probably dive into why you wrote this book. Why are you covering anti-patterns? Sure. So it's a funny story.

The whole journey book was actually a long winding road. So Manning had reached out to me years ago to write a book on puppet. I was pretty active in the public community at that time. At least from a user perspective. You know, I thought about it was like, yeah, sure. But then before we got things started, my first kid was born. So wasn't the ideal time to

write a book. So then years later they reached out and they were like, hey, curious, if you'd be interested in writing a book on devops and I was like, I don't know, what a book on devops would be like, but then when I thought about it, I was like, you know what, I want a book that is practical advice for people that aren't in glossy startups. That take so much for granted when you read a lot of these books either their companies that are firmly rooted in

technology. And the fact that everyone is sort of bought in is a Ian and that's not the reality that most people face, most people are in a company that's got a lot of Legacy baggage. They've been around for 40, 50 years. They've got all this inertia around bad practices that we don't really talk about in the devops books. We just simply say stop doing what you're doing and start doing this and it's like, okay,

easier said than done. The other thing was a lot of the books seem like they were written for ctOS, right? Because you would need either a CTO or CT 0 by in to be able to do a lot of the things that they're talking about and I was like, well, there's got to be stuff that and Visual contributor or line manager or something can do because I feel like I've done it a couple times and I'm not a CTO.

So that was really what sort of got me interested in writing this book and it was originally titled devops for the rest of us, but people felt that was an exclusionary Us Versus Them thing and I hadn't thought about it from that perspective and I was like, yeah kind of hard to talk about a book about, you know, bringing everyone together and then immediately setting up as an US versus them. So the first draft of the book was not an anti patterns format. Actually.

This is my first first book, I was very academic and formal and I got a piece of feedback in our review process. So what happens is with Manning, you'll do a third of the book and then you release it to a bunch of potential buyers that will review it and give you feedback.

So we were doing that. With the first third of the book, someone made a comment and they said I've seen Jeff speak and I was really excited to see this book because he's such a fun energetic speaker and that personality does not translate

into this book at all. And I was like, wow, okay, so All right, and my editor was like, I don't know you well because I'd never really met miters who's like, but it sounds like this book, is you right now, so that's how we reworked it into this anti-patterns thing. So I could have a little fun with it, get more of my own voice into it. That's how it everything sort of took shape, but it was definitely not intended to be an anti patterns book when it started.

That was the result of a pretty cold but honest piece of feedback that I got. So I hope when we go through all these and Tibetans, you can also use some of your font styles. Going forward for that. So let's start with the first anti patterns, which is called paternalist syndrome. What is this actually, so I called paternalist syndrome. It is when someone in a relationship assumes the role of the parent, even though no true hierarchy exists.

And you know who I'm talking about, Ops people, raise your hand. We're all guilty of this the bof, for those that don't know. Go ahead and Google it, but there's a reputation of operation seems being the team of know and I Instead, I spent my entire career not hop. So I totally have all of that, baggage from years of devs, just throwing stuff into production, but I quickly was able to identify that the reason again, was the fact that we had

misaligned incentives. So the paternalist syndrome, is this behavior that operations people, anyone can do, if I'm really talking to the office looks right now, this Behavior where we assume everyone else is out to destroy the system, and we have to protect it. That means nothing can go to production. Action without it, coming through us first. That means no one can have

access to production. No matter how scene of a request it is you just can't do it because, you know, we can list a whole bunch of reasons and will invoke security as like a red herring as to why you can't have access. Well invoke audit controls as a red herring as to why you can't have access and sometimes those are true, but we're really starting from the position of no and defending that instead of starting from the position of. Yes, but yes, but how do we address? The audit.

Yes, but how do we ensure that access doesn't spread? And it's really a mindset. So what the paternalist syndrome will do, is anytime that there's a problem. The first tool we reach in the toolbox, is another gate, another checkpoint. Oh, this didn't go through change control or this went through change control, but Ops wasn't on the Change Control process. So now Ops has to approve every change in. Every time we do that. We're adding another layer of

slowdowns, very little value. Add work. And waste of time for Ops folks to there was a book that I read rework by chasing freed from 37signals. And it's one of my favorite quotes. I use it all the time. Policies are organizational scar tissue and every time that there's a policy that's enacted.

You could typically go back to some inciting incident where that happened and the policy came out as a result of that, almost to the point where it becomes lore and Legend, where no one was even there for that anymore. Right? It's like, oh, yeah. There's some dude on life support in the lunchroom. We keep him around just so he can explain to us. Why we don't do stored procedures anymore. There's always some story like that.

So the pattern listener on chapter talks about how we can identify some of the common issues that we run into and started the pattern of fear, honestly, because that's what it is. Fear around particular situations and how we can go about breaking that down using cams. How do we do that through automation? So that we can Empower someone perfect example as we don't let developers restart background. Select Services, right? So why not?

They know way better than I do. How the system is going to behave because they wrote it. So why wouldn't they be able to reset it if they're seeing something weird? Maybe we don't want to give them SSH access to production. That's fair. But there's a lot of different ways that we can restart a service and then we can expose those ways to the developer and Empower them to be able to do it all but then we don't know. Okay, and your program can send an email, right? That's not crazy.

So if you really want to know, if you're not approving and you just want to be notified. Have your script send an email saying, hey so-and-so just restarted the service through this access point. Boom, done. And the other thing with the paternal syndrome list, we often insert ourselves as Gatekeepers when we're adding zero value to the process. So I'll give you an example.

A real world example, from Central when I started at Centro. There were a lot of requests for ad hoc script execution from development. Hey, I need this Ruby script run so we would have to get the script copy it out to one of the boxes. SSH into the box, run the script, and no commit mode share that with the developer or he would look at it or she would look at it and say, okay, it looks good and commit mode. We'd run in commit mode and then I'd send them the output.

What value do we add to this process? Other than being a middleman? What value are we addict? And the truth is 0? Because even if we wanted to be part of the approval process, I don't know anything about the script. I didn't write any of this code. Okay, you're updating flights to make sure the campaign IDs match. I did. You know what? That means? You just said a bunch of words that I'm going to take at face value that. This is important.

So, we said, what if we were to set this up via gyro, maybe we could create a change process where someone attaches a script to the jira ticket. The Jura ticket, has to be approved by another Dev who is way more qualified to approve it than I am. And once it approved, we have some automation crab the script from the cheer, a ticket and executed on the box and then attach the output to the ticket.

Now, you don't even need us now. There's not a Dev sitting around waiting like, My goodness is Ops back from lunch yet. I really need to run this script but no one's there. They can just do that on their own. They self-police it, but he self-govern it. Now everything is not without trade-offs. So now you have the issue of like, oh, we don't have to fix this problem because any time it comes up with just run the script and because there's no friction anymore.

A permanent fix is probably not as attractive to them. Whereas before they had the pain of going through Ops to be able to entice them to do that. But still all in all it's a boom. It's a win with some standardized scripts with they read and able to push that functionality down to customer service. So, the customer service is 0. When you see this problem, you have to submit this gyro ticket and then once it's approved, you'll be able to execute this script to clean it up.

So the paternalist Ingram is really about changing your mindset getting out of the habit of just being no and shifting to a guess. But yes, but how do we solve for these problems? Because no one wants to feel like they don't take their job. Seriously. Everyone at work is doing their best with the I was at, they have to do their job and this idea that there's a group that instantly assumes you're an idiot is not a warm fuzzy

feeling for anyone. So, yeah, I assume mentioned all these stories like operations, people assuming the role of a parent like it's what they always say. So the Ops job is to actually make the system safe, mix the system secure, stable. Whatever that is while the other parts of the company or the other parts of the team, like death is actually out there to introduce changes in stability, like what you mentioned. It is like they are there to

destroy the company. So I think when you mention all this anti-patterns, it's resonate a lot with the traditional world. Yeah, there's something else that when you said make the system save, you know, that's an interesting way to phrase it because I completely agree. Think about the system that we're running this on as like a huge tool box. Basically, what we're saying is the only tool we have is a bunch of sharp knives.

We don't want to give you access to it because we're afraid you're going to cut yourself and it's like, well, okay, maybe we could throw some different tools and right. And if you get through our Marin maybe we could throw a spoon, something. That's a little safer that still allows me to do the job. But yes, if you're asking me to hammer a nail with a machete, this probably unsafe and that's essentially what we're doing. Like, wow, because we've only

got machetes. I can't give anyone access. So, yeah, for all the listeners, who listen to this. I hope you notice this patterns or empty patents in your team. Make sure that you don't assume this parental role unnecessarily. So let's move on to maybe the next anti pattern which is quite Amin for any administrators operations people, which is alert fatigue. So many times there are so many alerts are popping up. Probably, we don't do actions on most of them.

Can you explain a little bit more about alert? Fatigue? Solar fatigue is actually a term borrowed from the medical industry. It came about from nurses who would not respond to beeping alarms and hospitals, because the alarms always go off, they always go off. So, they became desensitized to it to the point where even when there was Emergency, there was no way to elevate that emergency beep beyond the cacophony of sounds that was always going off from these machines.

So we borrow that term alert fatigue and Technology to say like, what alert do we have that are just constantly firing that are drowning out more, critical useful actionable alerts. I think the key word there is actionable, when we design alarms. We design alarms from the perspective of Things we think might be bad. We said, oh well alert on high CPU utilization because that sounds bad, but it isn't really bad. We buy these machines to use them.

So this idea that we're worried that a machine is certainly forty or fifty percent utilized. That's not really a bad thing. Let's say it is 90% utilize do we care? What are the other factors CPU utilization on its own is not a reason for concern, at least not to wake someone up. So I think we've all gotten those. Wordsworth like database utilization is high and it's 3:00 a.m. We're running all of our batch processing. Like that makes sense to me.

But why am I being woken up? And then the other thing to think about is if you can't design an alert that leads up engineer to take a next step or action, you need to seriously question the value of that alert. So when we have an alert that says replication, lag time is high. You should be able to say in that alert with this is actual alert that we have. The alert will say replication slot, 1 has exceeded

replication. Time chances are, this is related to the database replication service, being run by the Bia team. You should investigate that database and see if that's the cause of the replication lag. If it is restart the division connector in order to catch it up. That's a very specific set of actions where I don't have to do a bunch of things if that's not the case, if that's not what's going on.

And it's like, well, clearly When we wrote this alert, we had a very specific set of scenarios that we were worried about and it's outside the boundary of that. Maybe I should look deeper into this. So this idea of alert fatigue is this idea of making sure that these alerts are actionable and if they're not actionable rid of it just get rid of it because it's not helpful.

Now, I'm assuming you've been on call before and I'm assuming not casting any blame on anyone but I would imagine that you've probably received an alert that you said. Oh this alert usually clear. Has itself, let me snooze it for 15 minutes. Everyone's done it. Everyone's done it. So the question is why not just increase the threshold of the alert and people like, oh, well then if there's something wrong, I won't know for an extra 15

minutes. Well, you don't know anyways, because your first action is always the student for 15 minutes. You don't know? Anyways, because that's the very first thing you do is like, oh man, that's stupid memory thing. We always know that when the code gets into this particular section against memory high and it clears itself, so it's like Not really doing yourself any service by having it. So push it 15 minutes when it alerts you need to react quickly because you're already behind

the eight ball. But guess what? You're reacting and you know, it's real. I would rather be a few minutes late to an alert. But no, it's real then to be constantly alerted and not being sure if it's accurate or not and having to figure that out and decide because I know as human beings, we're going to err on the side of the pattern and just say, I'm going to do this. If you're at a barbecue, you're having a friend. You're out with your friends, having a cookout, you're eating

the sun's out. It's a beautiful day, your alarm goes off, and it's like all except memory alert again. Never ever. Ever, ever again. Be like, guys. I gotta go this thing that alerts every 20 days or whatever is alerting again, and I got to look at it. No, you're going to snooze it. You're going to continue chatting with your friends. And then you crap your pants when you realize it's a real alert and you've got to do something.

So alert fatigue has really focused on identifying those patterns and trying to make them better. Or just simply eliminating them. I think another big thing that the chapter talks about it as well, is creating metrics that reflect a business impact going back to the CPU. Utilization example, if the database is at 90% CPU utilization, but our transactions per second is steady and isn't climbing, do I care?

I don't care work it. Yeah, sure. 90% utilization now, that's not to say that you don't want the metric, right? You just don't want the alert because the metric is good for trending capacity planning, all of these. Great things. I'm just saying I don't need to know about a capacity planning alert at night at 3:30 in the morning. That can be an email. And that's another thing too. Right? We always default to waking someone up as the default alert.

There can be different types of alert, have a low priority alert that emails you so you wake up in the morning. And you say, oh wow. We were at high CPU utilization last night. Nothing else was impacted, but that's a good data point to know. And I'm much more receptive to that data point this morning at 9:00 a.m. Well, I've got my coffee as opposed to three in the morning when I don't know what I'm

really looking. We're trying to solve so get rid of alerts that don't mean anything to you tweak your alert notification settings so that you can do emails instead of always paging out, try to tie your alerts to some sort of business impact. So that you know, whether you really need to wake someone up or not or if they'll or that's firing is something that you actually care about, because again, I like my databases busy as long as they are within their

operating thresholds. So you mentioned something that I pick interest which is it's okay to actually be alerted late. Inside of always have fin first minute, right? You got others popping up here and there because some alerts do actually recover by itself. Because of this anti patterns for sure, people just put in alerts, but actually they could recover over time in a short period of time. But what you're saying here is that it's okay to actually be alerted late.

As long as you can guarantee the actually it's a real problem and you are supposed to take an action on it. So I think it suggests that I think for everyone here who has been in operation or still working in operations. You should probably tweak your alerts in order to Behave much more properly. Yeah, absolutely. Absolutely, because very suddenly, those extra minutes

actually mean anything, right? And it's one of those counterfactual is that we talked about in these incident reviews where it's like, oh man, if we had known about that Alert, five minutes earlier, we could have prevented the outage probably not. You have no idea how quickly 5 minutes goes by. You're looking at something, you're tracking down a red herring. You're like, oh, yeah. It's probably this thing over here when it's something completely unrelated.

A lot of times those five minutes are find you as much as you think. They are. Of course, your mileage. May. Very take that with a grain of salt. If you're in high frequency trading maybe it's a different ball game, but for the most of us, but I target audience. It's like, yeah, you'll be all right. So let's move on to the next anti pattern which is about wasting a perfectly good incident. This is interesting because a perfectly good incident. Maybe he can you explain about this.

Yeah, so actually stems from saying from local politician here in Chicago that always would say never let a good crisis go to waste it's the This idea that there is so much to learn from an incident because when you think about it, so we have these mental models of our systems, right? These systems are becoming so complicated. So complex, so many different pieces. So everyone in their head has this mental model of how they think the system works. Then there's the reality of how

it actually works. And those two are seldom in full alignment because usually something off. And the Delta, the time that you find out that your model is different than reality. There's an incident. Other than that, you can spend the rest of your life and complete the illusion thinking, you understand how the system works, but then when there's an incident, suddenly the gap between your understanding and reality is exposed in its raw

form. So often we sort of just close it, incident ticket and move on, but it's like hang on. Let's dig into this incident and try to make our mental models better and learn from our mistakes. It goes beyond just the tech side. It's also the Human Side so simple things that you've uncover like, okay. All right Henry, I noticed that this alert fired. You got it in, snoozed it and then it be alerted, 15 minutes later. And then you engage what happened there?

Oh, well this system alerts all the time and it typically Auto recovers. So when I got the alert, I snoozed it. Thinking it was going to recover but then when it didn't recover, I realize, okay, something's really wrong. So, as a manager, be personally, right? If I'm not part of the on call. Patient. I may not know that reality that Dynamic exist. So just by simply asking that question in the incident review, reveal something to me.

Like, whoa. Okay, we've got alerts that are so bad people ignore them because they're so common and I'm sure everyone on your team is going to be backing you up. Yeah. Yeah. I know that alert. I hate that alerts. So now it's like, okay, so clearly I have poor learning that poor alerting is impacting by on-call team, because every time you're waking someone up, anytime you paid someone, you're interrupting their Life, they could be at dinner.

They could be in a movie. They could be taking care of their sick, mother. You have to think when we page out. What is this person doing? In their life that I'm interrupting? And is this worth it as a manager. If I'm not part of the on-call rotation, right. That is information that I can get on the incident review process. Okay, cool.

Here's another example, a real world example, where two Engineers were talking about the same system using different terminology and Didn't realize that they were talking about the same system because one team use this term a and another team uses term be. So my Ops guy was thinking that there's some new system that he doesn't know anything about that. He's tracking down and he's pissed because there's no monitoring. There's no metrics around it. But lo and behold.

Oh, no, we have a terminology difference. What you're calling sidekick. We're calling consumer Damon and they're the actual thing. That's a huge disconnect. But now that he understands we're talking about consumer David. He instantly has a different view of the entire scenario because he's like, oh now the know we're talking about consumer Damon. I understand that these aren't technical problem.

These for human meat, space problems that are coming out of the incident review process, but once you start to dig in and peel back and get into people's head space, it's like, oh, okay. Alright, this is making sense. So I noticed Henry after that was all over you restarted. The service. What made you think to restart the service? What information LED you to that? Well, honestly, I was out of ideas and I thought maybe I noticed the memory utilization was high.

So what made you look at the memory utilization? Well, I normally don't look at that, but I happened to be looking at a different screen and saw that it was high. So, I just said, well, why not restart the service? Okay, but when you restarted the service, you didn't realize billing was running and that interrupted the billing process. So, now, we're not getting bills out on time, which was an ancillary effect. Oh, I didn't realize the service, communicated with them. Billing process.

Oh, yeah. It's part of the billing process, because they share a key or something like that. Oh, well, it's not like I intended to impact filling. I just had no idea. All of these sorts of conversations happen because again, everyone has a slightly different, mental model of the system. So, wasting perfectly good incident. Is this idea that you just say? Oh, yeah, system was low on memory Henry, restart that service recovered service.

Didn't just recover. You impacted the billing team who now has to rerun the billing shop you're communicating with. Engineer didn't realize you guys were using the different terminology had you known that. He might have made different choices. If you knew that we were talking about this particular service. We've discovered that, oh, poor alerting and because of that people are hesitant to do things right away. They're waiting until it really

hurts. There's all of this information that could have been easily dismissed and wrapped up at a memory utilization with high restarted Service close ticket. So wasting a perfectly. Good instance, really? This idea of like there's so much more to be done. Information in a failure that we can bring out. If we just really want to put some energy towards thank you, explain about all these different scenarios based on

anecdotes and all that. I realized this is like what some people call it post, mortem activity, but one thing that I do find a challenge sometimes like for those people who are involved in the crisis of the they solve the problem, of course, they're like, okay. I don't want to deal with it anymore. That culture of assessing this incident assessing. What can we learn from it? I think it's not there for Those are the people, I would say. So how do you actually inculcate this culture?

So that it becomes a thing. It becomes a common thing that people actually want to do it because they feel value out of it. So the first thing that you can do as an individual contributor, I'm assuming you have the willpower. So I'm going to say an individual contributor in this scenario. The first thing you need to do is, do it fast like immediately after the incident within 24 hours. Why, because things have a sense of permanence in our minds, but only for a short period of time. Time.

So it's like something permanent, but only, for a short period of time. That's how it works within that 24 hour period. This is the most important thing that has ever happened. But after that period, it becomes just noise in the background while you've got all of these other demands on you. So if you can get people in the room as soon as possible to talk about it, the actual incident is fresher in their mind, so they can recall more Vivid details are much more energized around it.

You're really manipulating their psyche. Honestly, by doing it early to say like, hey, let's do it now while you're super interested but our dealer never let you lead the dealership. He doesn't want you to go home and think about it. He wants you to remember what that SUV felt like moment you're in the store so he can sell it to you. It's a basic human emotion.

So the biggest piece of advice I can give is like I said schedule it as soon as possible definitely within 24 hours, but once you do a few of these, I guarantee you even without any real training in the process. You're going to discover things and as you discover them people You will instantly see the value in it. The hard part, is the follow-up action items? Because how do you influence people who schedule and prioritization?

You have no control over to make sure that some of these things that were brought up, are addressed. Sometimes, it's worth doing it, just to have the knowledge just to be able to see like what we know these things now and even though we're not going to correct it, the next incident, we're aware of it, but for the things that you need to actually fix that, where you really need some leadership.

And could be up to like a, these things were things, we discover in the incident post-mortem process. We really are going to need help and commitment for your team to address some of these putting dollar signs next to the actual potential risk help sometimes. Yeah, we lost seventy five thousand dollars for a 15-minute outage. Sometimes that can be persuasive, not all the time because again, it's not coming out of anyone's paycheck.

So by Design, the only people that really care about that it's management. But yeah 24 hours proving the use out of it and Documenting that use reading that information is part of the sharing portion. Have it somewhere where people can see and review it and understand it. The other thing that really makes it shine as when someone is looking at an incident and they see this beautiful post-mortem incident review, and then they go to another incident and it's just like restarted

service. We're good. It's like, whoa. Whoa, where's this? That sort of creates this pattern of like, hey Henry, I noticed the last incident. You ran you didn't do a post-mortem, like the other pin members of the team you should be doing. To and it just becomes a

cultural thing. It just becomes this thing where like everyone rallies around it, but it takes time but you know, lead by example, do it yourself and when you do it, don't treat it. Like it's some new fandangled thing to it as if it's the most logical thing that you should be doing. Yes, of course. We're going to do a post-mortem. Why wouldn't we? We just had an incident to hide the fact that this is the first post mortem you've ever done before in your life.

Just tell it like it's the most natural Next Step. Yeah. We're going to post more. We got understand what happened. Don't you think we need to understand what happened? Say no to that. That yeah, I guess we should probably have a better understanding. Yeah, so that's why we're going to the post-mortem that's enough. Come on. So as you mentioned the beginning, right? Probably all these also is part

of culture. So you can't just switch everybody in a second like once an incident happened and yeah, everybody will just do post-mortem for the next few incidents. So sometimes I think we need the buy-in and also like I would say, probably policing what you said is like leading by example, right? So the people at the top, probably also needs to spend the time or actually allocate the time to actually do this stuff. Because it's important and High value for the company, not just

for that particular team. And like, with any cultural change, you got three categories people, right? You've got supporters, detractors and fence sitters. The majority of people are fencers the vast majority of people and this is true of any context you look at politics. Most people aren't on the extreme ends of either side. Most people are in the middle, but the extreme ends of the loud one. So, those are the ones that we focus on, it doesn't take a lot of people to change the culture

of an organization. It takes a few. Supporters or even a few detractors, it works both ways. So positive influence. It only takes a handful of people to be able to change to have the fence sitters move over. Same thing with the - though with the negative behaviors, only takes a handful of people to be like wall and also how we're going to do things and then suddenly your cultures in the tank. So think about that as you're recruiting people for these

different aspects, right? Who are the people that can really be boosters for me? And I don't have to convert everyone. I've got to convert a few really boisterous. Cheerleaders for this, and if I convert them, people are going to follow. And once people start following, you have overwhelmed, the detractors and they lose. So as I hear your insights about these and the patterns the audience here, actually can find

more Independence in the book. They are some of the topics that I think like really relevant, but because of the time I'm sure we cannot cover all of them. So for people who are interested, go by the book, or read the book, and you can learn all the fun styles and Tibetans from Jeffrey. Yeah, and you should buy the book.

I don't want to go over all 12. So Jeffrey, Before I Let You Go, normally I have this one last question that I always ask all the guests which is called the tree technical leadership wisdom. So this is just for people to maybe learn from your journey. So what kind of wisdom that you have in your career that probably you want to share with everyone. Okay. The first one is, of course, never let perfect be the enemy

of good or good enough. So there are a solution that is 70% effective is better than A perfect solution that has been implemented. So, never let the fact that it's not perfect stop you, because guess what? It's never perfect. It's all perfect. In your mind. Keep pushing get it done. You're going to learn a bunch of stuff. It's not going to be perfect. Something's going to be screwed up. Second piece of advice is belong. Same themes of this whole

perfect. Is the enemy of good discussion is that there is no point in a project Journey that, you know, less about the requirements that in the beginning, when you start that is the most minimal amount of information. You're actually going to have about your requirements. So keep that in mind when you're designing a solution for something because as you move in the project, your requirements are going to become more and more concrete and understanding

product people and project. People are going to come to you with all this list of requirements and it's going to look like they did a bunch of due diligence but know that they only know so much at this point and cut him some slack as a result of that. So we have a problem now at

work. We're Ops gets involved too late in the life cycle of a project but a lot of it is because when they start they don't know if they're going to need op support or not, because it's like, well, if it's a feature in the monolith, we don't need Ops for anything. We can develop that on our own, they've given us all the

automation tools. We need, we can, you know, do everything we need to do without the but then if halfway through the project they pivot and they say, oh this needs to be a separate micro service will suddenly that's a whole new ball game for Rob's, but we have to accept the fact that They did not know that when they started, they made the best choice that they can make. So, always keep that in mind. I guess, the third thing is, as an engineer.

You have an implicit bias and just about everything you do and it's particularly bad in technology. There is a strong sense of this. I didn't write it. Therefore. It's crap. Good engineer knows the difference between a preference and a problem. It's a pattern that I see all the time. An engineer comes in. They get hired and ready to get started. They look at some code or something in their life. This is all wrong. Can't do anything.

It's funny because this thing's making this like 140 million a year. So tell me what is so broken about it. Is it perfect? No, of course, it's not perfect. There's a bunch of problems with it. See my previous two ideas and suggestions. You have to understand what to preference and what's a problem. So that you know, where to focus and put your energy because no one makes a We're out of

rewriting. Something that was already working and just introducing different problems because you never actually fix it. Right. My shifted, you might change the sort of problems, but there's always some new problem that you're going to be dealing with. So accept that and know that everything you do is going to be future. Use thing that they hate, you're going to make a choice and someone's going to come in 5 years later the who's dumb. Why don't you write it and go?

Why didn't you write insert new language? That's hip and trendy and I'm going to add a fourth one to when you make it to try. Choice, document, your constraints and your context when you make any sort of technical decision. Why did you make that decision? What was the reality on the ground? I remember a story where a guy was telling me about all of this handcrafted code, that was built at his company and his Chi, don't understand why it is still in use kubernetes to

orchestrate. All this, it was, like, why did they write the stuff? He's like, oh, those probably eight ten years ago. That might be why they didn't use kubernetes right as I can. Yeah, I thought about that yet. So he got 10 years of energy built into this thing, just switching the kubernetes like that isn't an easy thing. So there's always context around every technical decision that

gets paid. If you can document that it'll save you some Hassle and future Engineers, some energy around understanding why particular decisions were made. Thanks for sharing these visit them. So I'm laughing, as you said all this because I can see these patterns over and over again, in any places that I went into. So this is just by Common, so, Thanks, Jeffrey for your time. So for people who wants to learn more about you or connect with you or find the books. Where can they find you?

Sure? Yeah, so I have a website that I don't really update or maintain. But if you feel nursing you can check that out at a noble devops.com. Most likely the best place to find me. Is that Twitter where I'm at dark and nerdy you can find the book operations into patterns develop solutions that Manning.com if you want to order direct from them both physical and ebook copies, but it's also available on the Amazon book. Story inaudible I do. Read it though.

Unfortunately, everyone's like, I bought this thing and you were going to read it. Like I didn't even know they were turning it into an audio book. I got the email, the same time you guys did. So I'm going to make a Lobby for the second edition that I read it though. So we'll see how that goes. Yeah. I mean like when you read it, probably you can impose this fun style.

So for people who are listening so that can also be entertaining at the same time, which I think I'm doing it as you speak just now. So thanks again. Jeffrey for your time. I really learned a lot from this conversation and I wish you good luck for the things that You do. All right. Thanks for having me. I had a really good time. Thank you for listening to this episode and for staying right till the end.

If you highly enjoyed, please share it with your friends and colleagues who you think would also benefit from listening to this episode. And if you're new to the podcast, make sure to subscribe and leave me your valuable review and feedback. It really, really helps me a lot in order to grow these podcasts better. You can also find the full show notes of this conversation on the episode page at technology. No, the death website.

Including the full transcript interesting quotes, and links to the resources and mentions from the conversation. And lastly make sure to subscribe to the show's mailing list on technology. No, the deaf to get notified for any future episodes. Stay tuned for the next technique Journal episode. And until then. Goodbye.

Transcript source: Provided by creator in RSS feed: download file
For the best experience, listen in Metacast app for iOS or Android