What's up born, How you doing.
I'm doing I'm doing pretty well. Thanks, Thanks, well cool.
I cut my intro a little shorter this week because I just kind of assume, and maybe this is a bad assumption on my part, that people know which podcasts are listening to because they had to click on it, and so I felt like it was kind of redundant to say, welcome to the Adventures in DevOps podcast.
You just you just did it right.
That was subtle though, right, that was clever. I'm gonna just pat myself on the back for that one.
Yeah, I've got I've got actually an interesting fact that I can share, so you know, we can jump into that there was an OTP provider that actually changed hands similar to the XC vulnerability and compression and Linux not too long ago. And for those of you that did OTPs, that's one time passwords, so you can think.
Of like an off path that you got installed in your phone.
And it's sort of ridiculous that this even happened, because if you think about how bad it is for a open source library to get co opted by a militias attacker, to have an application on your phone that is also responsible for security for two factor off codes to change hands. That provider now has access to every single one of those users two factor off and it could be even primary factor if it comes to password resets and whatnot.
So that's a great way of stealing credentials. And I don't think it's an attack factor that a lot of people think about. I think they, you know, it's like whatever, it just stores my two factor codes, doesn't really matter.
But now there's actually a huge problem that could come up because of that.
Nice I look forward to it. I'm just excited about that.
I think everyone really has to switch over to weboth and that's the truth secret. And if you don't know what that is, come and talk to me after the show. I'm happy to give everyone an earful about that.
Or can we just give up on it and just everyone uses password all lower case letters for their password. I mean there's some credibility to that approach.
Right, there's a whole episode there.
Maybe maybe, But today's episode, we're talking about one of my favorite topics, incident response and on call management. And to chat through that topic with us, we've got Felipe Jane, the CEO of pager Ley, joining us today. Felipe welcome.
Thanks.
So i'mon guys, happy to join on this and you have a biggest psion response and ons right on.
Cool.
I feel like incident response is a learned skill, you know, and it's learned on the job under pressure when everything's going to hell, And prior to your first incident, you never even thought through that this is where your life was going to lead. So how did you end up starting a company dedicated to this incident response?
It began with like when I started out my first job basically, and I was part of Amazon and in the dtail page TAM, which is friendly one of the highest sort of pages in terms of traffic difficulsider, And that was like the first hand experience. I like, my manager put me on oncoholic within sort of months of joining, and he told me like, hey, this is like the best way to learn about things, And I kidnam it.
That was some serious pressure, say in.
The initial initial stuff and I and since then I've always knie, yeah, this is the best way to sort of learn things because you are absolutely learning each and every bit of things.
In the shortest possible amount of time.
So I think that that was my first I would say interaction with the incidents and the on cord world.
Well, let's be realistic there. Your manager's thought process was, actually, if this guy's going to quit, I'm going to make him quit sooner rather than later, So I'm putting him on call.
Right at the beginning, it could be.
I think it was probably you know, just titting.
This is how you get waters And yeah, it was pretty pretty sort of a new world. Before that, I always thought like, yeah, software development is more about building stuff and you know, maybe designing it. This part is truly what you see a management or managing your maintaining your product.
And that was like the first kind of interaction I had.
Right and cool. So then you went from Amazon. After that you went to Disney, right yeah.
Yeah, so in Amazon it was pretty interesting.
The dal page and like there are they It was sort of quite a few days, especially during Fine Days and cyber Mondays and the Christmas week is always like pretty high pressure stuff and I remember each and every one of them, like in terms of the events, like even if there's like a small blade, there's like I think so many teams on a single bridge in even in different locations, and everyone just giving their status up dates each and after time and it was like it
was almost like i'd say, like a bar room.
Uh.
People I think have now started to tagg in on called rooms as barrooms. Like everyone givings the startus updates and see update and go.
On for quite a few nights. So that was I think like pretty interesting in the Disney.
It was on the other way around, like our major events in Disney for the live stream. So in India we had a cricket as the major school and we used to live stream cricket and we had around twenty million concurent.
Viewers also at some point.
And with that scale, each and every bit of system you know, from starting from where to CDN even to a load balancer, even to a small humanities file to even to maybe our cashing systems, everything gets tested a lot. So for us in Disney, uh, that was our major priority. How to you know, manage our on calls response instead of responding on calls during the live streams because we cannot even afford to go for a minute down because we know how much takes the life events.
Are for the company.
Well, especially when you're talking about like live streaming cricket to Indians, because y'all take that seriously.
Yeah, yeah, yeah, we you know the you know, the broadcast contracts are for you know, billions of dollars and let's say we go for Doubt for a couple of minutes. We're really using losing money at every point in time, and we can see the Twitter. You know, we're just praending on trat real like your app is down on
what's happening and what's not. So you need to be very very careful you know, how to respond publicly also, and how to you know, quickly bring things up in a way that it can last at least during the live. So so those were like I'd say, the most you know, the closest I can be a customer at that point of time, and the most engineer can be at the highest pressure point. So yeah, so that was my Disney stint and those that I've started out with patiently.
Right. Yeah, So I think after that you're kind of committed at this point, you yeah, you're just your career path is now incident response after those two stands.
Yeah yeah, yeah.
So interestingly, I think, uh, the two companies had slightly a different way of handling response, maybe because.
Of the company size or team sizes.
But overall, I think that the concept way like everyone was leading to a single sort of a role where we wanted to reduce the same incidents again and at least that that's.
Our primary rules.
And I see at what I saw in the two things, like the primary part of the incident response is the process, Like how do you sort of you know, set up the processes, how do you enable your engineers to you know, follow these processes?
Uh?
And and assistant I think like as an engineer, as a developer, ah, that's why we call on call as an operational part. Nobody wants to you know, spend a lot of time on it. Like everyone wants to maybe code, develop features, maybe even design, architect even blog nowadays, but on call is the last part, Like everyone wants to spend time, and so most grinted work, especially if they're like you know, work after work related to burgs or incidents.
So that's where I saw a lot of common patterns across you know, incident response, even.
On call management.
UH is like like we there can be a lot of tools or automations or even.
Assistant agents which can help the engeneers.
So that's why I kind of you know started started with ag which is helping the teams to assist the incident and management.
Yeah, I think that's a solid point that's often overlooked. I work a lot with early stage startups, and it's a pattern I've seen over my career, like the biggest part of incident response happens before you ever have your first incident, but because you have to talk through like what are we going to do when this actually happens, who are we gonna bring on, how are we going to carry out communications? And so Yeah, I think that's a good solid point. I like the fact that you
mentioned that there's multiple ways of doing that. You know, there's not one right way.
Right right, So there, I would say, like the first part is like you need to sort of realize, yeah, the time has come in my organization that we need to have this set up. I what I've usually seen it all ninety percent times it comes down from the top leadership with if you have like cetos or epes depending on the organization size, If if those books have come from a place where incident response or oncal processes had.
Been in place, they bring that culture into.
The company because they they have realized value over the time of these processes in companies where they have not like usually or they take a longer time to realize, yeah, we need such processes. So that's the first part is to realize like, hey, these are the processes we need so that we can at least, you know, radiuce are on our issues in a longer.
Frame of time.
So that's the first part. The second part is them to set up that on called roster. So on the rosters is something like uh, now that is something that which is very dependent on our too work. Some emanations wants to have a centralized on call team which kind of handles everything like let's say, even if it's like a re issue, they do it, even if it's like
an infra situation, they handle it. And people have you know, different ways of setting up maybe like one one person from each team or just one person every week, and they do some some omnations have theredd on called team for each of their separate teams, so uh, and they kind of rotated weekly by I becieve monthly man.
So that's the other part.
The second part is to set the oncle on roster and I think then pergnission takes some time until the oncle rosters get setted and.
They start you know, debugging tickets, and.
After some certain amount of time then they go into the setting up the incident response part, which is the post mortems as well as you know, figuring out Hey, like, these are the sort of our general kind of you know, steps we take to solve an issue. These are the certain workflows that we do. Now let's try to streamline this both in the response as well as in the post mortem process as on the post modern analysis.
So that's how that's what.
We have seen orgnisation go from step A to step to last parties of post models.
You said something really interesting I think, which is I haven't worked at any company so far, and even my own authors here we have process. It isn't like we don't do anything when that happens. Maybe it's because we're a tech focused or a software focused company, and that's pretty much where I've worked. But the part that you said that was really interesting for me is that software engineers don't like on call and you know, I have I want to challenge that or like, you know, I want.
To live in the world where it's not a problem.
It's like, why why do people not like it so much, any thoughts about that.
Because people as engineers or developers, they don't consider as part of the building process. We love to build, we love to you know, architect things, but once we have done that, then we don want to sort of you know, going and you know, fix out just one tiny part of it which is actually causing the major issue, but going there fixing part of it and probably you know, just taking a blame or so people kind of have that kind of bias. Also, Hey, my my product is like a bug three.
My product is you know, super gid so so.
And going there fixing this as well as you know, you already have a lot of other work going on the strains in this agile world. So that's where people don't defini sort of difference, don't want to spend a lot of time on it.
So that's what you think.
It's not like well prioritized or rewarded. Like if you do on call work, you're not rewarded for it. If you write buglass code, you're not rewarded for it. So you know, whatever I don't want to do it, it's going to happen, and then I have to pay the I have to pay the fine because of it, and I don't I don't get the benefit.
Yeah, I think like benefits and all, like cans probably be sort of be defined by the engineting managers or team leads if they want to sort of reward or they want to highlight maybe you know, like if the person has solved these many decads and these incidents, or maybe find a better way of you know, rewarding rewarding uh, developers who actually solve a lot of incidents.
But yeah, I think like in.
General sense, like it's not part of the building, it's only maintain, but that's the major.
Like, no, no, I totally got it.
Yeah, I don't like on call because it's never my code, it's always something else.
Yeah, but I mean that makes me think there's something Yeah, no, I totally get I mean I feel like there's something fundamentally broken there.
Like I've seen that where I worked at one.
Of the previous previous jobs, that was twenty fifty engineers that were all rotating through all the same on call schedule, as if somehow code just because it was all in the monolith, if if something was broken, I somehow would magically know what was going on and say I don't know products or logistics code when I had nothing to do with the development of it, Like I like it might as well be like some for it, like I don't know Aramaic or you know, uniform to me, like
I have no idea what that was going on there at all, and yet somehow I have to come in and debug or find out what the problem.
Was, right, Yeah, I think in Amazon this was a case like the entire I the de deal page was kind of at that point of time like that, and and that particular sort of service the page is like maintained by more than one fifty software engineers. So like most of the times you are debugging something that you have not ruled, so and you're evening, it's a three in the night, and you're already frustrated, like you don't know what's happening. So that's why I think some bit
of uscision do come from. And that's why I think like good processes can sort of somehow mitigate some of the pain points, especially like institutions.
Like was the expectation at three am that software engineers should be able to log on and identify the problem and push out a fix like that seems like there's something that would never actually have like actually worked out in practice.
It did like something. So like it kind of depends so what kind ofs that you have in place. So one past one part is to maybe mitigate. Sometimes medication can be done just to give a revered last running. That's one medication people don't have. But that's even that
is not done. You probably need to sort of page the person who has probably added that line of code and take hit from him or her, or you maybe have certain like a time of team or something like that which can sort of you know, uh orches straight and collaborate with a lot of different on calls for different developers.
To sort of mitigate and put up put a patch or fix the issues.
Right, So this kind of like depends on you know, how you have set up that inst response forrocess itself.
Yeah, I think that's a really good distinction to make there that like during an incident, oftentimes the primary goal is to mitigate the problem, which is different than solving
the root cause of the problem. So like if if we're run an outage or an incident, like we might mitigate the problem by launching you know, fifteen more Kubernetes pods with just insane amounts of memory, just so we can ride through the problem too, we're able to figure out the root cause and test that theory and then deploy a fix for it.
So just so I got you right, Well, your strategy for incident management is turning it off and back on.
Again absolutely three times. Always reboot three times.
Yeah, I think like legging ice stream.
You know, one of the major kind of our last heart was to just put the live stream on and don't maybe don't have a pay on or something like that.
So that can be one even in.
Different scenarios that you can sit like, okay, even if fixing is stacking time even.
Immedi maybe do something. Maybe just put that I took there.
There was a certain uh forcessiarity had in peace, so I just said, like, just increase the community spots and at least that your customer is not facing that issue for the movement, then they use use that time to to actually fix the issue.
How do you decide what mitigation strategy makes the most sense, Like, if you like, I feel like we're in the case of the world now where we're going to automate whatever it is. So if we have some number of failures, Do we just immediately start deploying extra pods? Do we immediately try to roll back to a previous code version? Like can we even know upfront what the right approach
is there to automate? Because the last thing I want, I feel like, is someone to get online and after half an hour be like, you know what, maybe we should deploy some pods with an insane amount of memory that will solve all of our problems.
Yeah, yeah, it's interesting.
It was like like as an engineer, like even if you have some similarity, you would know this is the issue.
Let's see, if you're seeing them mentally, then part increase makes a sense.
But let's say if there's a something else you're seeing a lot of five texis, probably the part increase might not make a sense.
They're probably reverting to a previous version might make a sense.
So I think you need to have some bit of humanity, you know, with whatever system that you are hiding it in case you're not, then it becomes.
A huge challenges. Then you probably don't know you know what to do. So I think, like I think some hulmarities need at least to have that first modication.
Still, I think that alludes to an entire skill set in software engineering of how to troubleshoot.
You know.
It's because like debugging when you're writing code is completely different I think than troubleshooting a live system and all of its different dependencies and trying to figure out where the potential problem might be and how do you how do you get some faith in that theory and then do something to mitigate it.
Yeah.
Yeah, I think like a couple of techniques that we have seen and we had to kind of deployed. One is maybe do like a shadow on care like you can you do on calls, like someone is handing the on care, but let's say if they have major incidents,
you shadow them. So you know, these are tools like even even I've seen sometimes people don't even know like these are dashboards that you can refer to and which will probably help you more so, uh, reaching out to people shadowing them definitely hips, especially when whenever there's an incident or an issue in a system which you have not yet touched. So that is one other bit is what we had done was we had done chaos monkey
a lot. So chaos monkey like like a concept I think probably generated in Netflix engineering.
That uh you kind of have like game days, uh where were what you do is like certain infrastructures, you just put it down.
Let's say you put it down like a replica of let's say post is early and and see how your engineing team is performing after that.
What's how much time they are just.
Taken to mitigate or at least at least mitigating what's their mptity, What's how they're fixing it, how they're communicating, how they're collaborating, They're putting.
The right communication to the right stakeholders at the right time.
So those kind of events, those kinds of practices have helped, especially when you have not done.
For a long time. So chaos lunkey help does a.
Lot, especially tripping for you large events, and especially having like a proper collaboration sync with other other teams, because that's also what is needed. You do it within your team, but you also have to do it with let's say your maybe front time team or maybe your DevOps team. You need to do in sync to mitigating issue. So there are techniques to do that. So that't like right takes and right information also is like like the right education is also done for the for whoever is coming on.
I often wonder how how much these things provide value. Like way along the spectrum is the right time to start implementing, say a game day where you're taking your own stuff down, or the Simian army to inject faults into your architecture or infrastructure. Like I see a lot of companies that I'm I could say, hey, you know what, that's probably not the highest value day. Like they're like, they have so many other problems that I think they
should tackle first before they're ready to do that. But then on the opposite side, I'm thinking, wait, like if they did this, they may actually identify critical problems within their infrastructure that could cause them multi day downtimes or multi week downtimes, which you would have more catastrophic impacts in the long term. Uh, Like, I don't know, is that interesting? And do any thoughts of that?
And like like, of course, like your company at a startup stage or initial stages.
Where they maybe don't have a lot of customers or.
That's uh, they don't, they won't be doing this even if I'll say, I think once you have started having multiple teams, multiple engineering teams with say different different powers, kind of a system where sometimes the information is scattered between teams and you don't know, you know, like when a when a fire is there, you don't know who to who to say, like who put that?
And that's the beginning of it.
And as slightly, I think the team's start to mature, and the mature I think, I think that's the right time to sort of sort of start these processes.
Yeah. I think maturity is really the key word there because it takes you know, you have to have multiple layers of maturity there. You have to have a product that's mature enough to be tested, but you also also have to have maturity in your leadership team where they recognize and understand the value of saying, hey, we're not shipping new features this week, We're not shipping shiny new buttons. We're actually going to take the time and effort to see what it takes to break our system.
Yeah, I think probably, But a company having a launch, launching new things or launching a new product, and maybe a week so I think people do dog footing. They can add this maybe instead of response or as a part of it, so that you know how your team would be reacting on day one days ago one in two.
So I think I think it like generally sometimes even the managers or the management kind of starts to realizing, now we are spending a lot of time on these incidents itself, like our delivery for other important stuff is also getting impacted, so now we.
Should find time to set some process time for.
This or that, like we get you know, these incident it is so we can have a longer time for our whole features.
So it's always, you know about finding that right balance.
Even engineering managers or I believe have a tough time to sort of something justify a lot of spending a lot of time for these kind of things.
Like it's always I think.
That's I think that's probably there always conundrum they are in, you know, which which part to spend time?
So uh, they have to take you I'm like stifling maybe laughing here because I feel like I have so many previous traumatic experiences of some sort of on call event. You know that's on one side the other and the
other side. You said, it's like, oh, well, you know, the product manager needs to prioritize the factor, but like I want to hire that PM that actually is like you know what our insids, our incidents are are impacting our you know, future profitability, so we should actually take a look at it, improving ourselves.
Like I've never heard that. I've never heard.
Anything like anyone on that side say that, you know, like it's always the other way, like, oh, we don't need to worry about that is done right.
We didn't we finish that coverage to push it.
We don't.
We don't need to improve it anymore, and think like I think the.
First part is always about you know, having that right report or having some sort of information so that you could add like maybe you know if these are the these are incidents, there are recreatable incidents, these are the probably if you have some sort of a business impact to it, we show them their numbers and see like this is an impact and if you.
Want to sort of reduce.
That numbers of business impact, then we need this to I like, I think, I just think it's always a hard time to justify spending time on the instance.
But if you have that data, that data would be any use.
I mean, this is where like Dora is super successful, where we come in with meantime to resolution and change failure rate and so falling back on those statistics can be really helpful in the conversation to convince people that these aren't industry standard, that we have every single pot request we push out results in a bug in production.
That's right, right right.
I would imagine that most companies onboarding experience to incident response is a result of hitting a breaking point where they've had just outage after outage after outage, and finally they're like, Okay, we have to do something different, which is probably what leads them to you would that be a fair statement.
Yeah, yeah, I think like one is what you say, like having outages outages, and the other part is even if let's say, if they want to sort of stream into some process, usually they see like maybe oncoll is confused what to do, or maybe they are the OnCore is need to react, or the manager doesn't know what's happening, or someone someone doesn't even know how.
To report an for example. So there are different.
Different aspects to it. I obviously like like the entire incident responses part of two bars, what is you know the trigger or you know, how are you creating the incident? Like what's the trigger for that? And then how are you responding to that, which is like debugging, communicating, and then.
The post modern movement.
So so that's where we kind of try to come in, like sort of you can stream at the entire pipeline of it, like make it as quick as possible, make it visible across maybe stakeholders, maybe support across engineering teams and having the post modern analysis processes in the least. So it like I think, like like we come in when people when teams recognize too many repetated incidents or too many of these.
Stuff, and whoever is the on call is kind of feeling.
Very confused for a state of things. So that's where we have seen a lot of competition onto this.
Yeah, you mentioned the stakehold was there, and I think that's a really cool thing to dive into for a minute, because communication is one of the key things of incident response, and it's the one I always hated the most early on in my career because I would be in an incident and then everyone wants to know what's going on. Well, I'm working on it, damn it. But I can't work on it if you're sitting here hounded me with questions.
And so I think a key part of a solid incident response plan is having a communication plan so that you can relay that information out and free up the people actually working on the incident to continue working on it. How do you recommend addressing that?
Like I say, like an on call is a person who's always on fire who has to you know, mitigate the issue.
I think that's the namone everyone Connie, but because of the.
Environment, he needs to do a lot of things also communicated to them support also, you know what message of what's the estimated time or resolution from we communicate to manager, Hey, this is probably the impact these many users, this much
subscriptions are being impacted right now. So so I that's a major pain point the on call person has, Uh, what's the way The best way is to you know, delegate a lot of these stuff or maybe have a have a system which is you know, like which is visible to the stakeholders so that they don't ping.
The on call or they don't kind of you know, ask them again and again.
What's one of the ways we do via page leg is we live with Slack as a major part of the incident response. So let's say we created channel Slack channel for each incident and in the Slack channel, you can see you know what's the eating Uh, what's.
The business impact?
Or maybe some bit of information is like something some bit of it is a by the on kore, but nobody is like asking on again in the.
Game, you know what's the ETU? What's the impact? Let's say aug much wants to see they can go to the channel and see it.
Customer support can see you so like like like whatever, Let's say if someone wants to send an email, no one likely they can just send all that information to emails. What if the stakeholders that the company has so whatever kind of the you know, actions on cool has to do apart from mitigation is an additional effort. And whatever tools and resources they can utilize to sort of you know,
delegate an automatated would be much more helpful. Uh, And so that they so that his major sort of brain focuses always on mitigating the issue as quickly as possible.
Yeah, I mean, I think having those additional things in place once you identify them to help streamline the process are super important. Like we've got uh status dashboards that we can point customers to immediately, so rather than trying to explain where the updates are going to be or how they're going to happen, and you just go to the Zurel and stuff is there. But I mean, I think also as customers of SaaS solutions, we have like an opportunity to even be nicer to companies that are
having incidents. I mean, I think there's an emoji dedicated to this, hug ops right, you know when something's happening, you know, pass pass on the empathy a little bit.
Like I care way more about as a customer that you tell me that you know that there's a problem that someone's looking at rather than being like, oh, we don't know, I don't I don't know what's going on, or you know, even someone's looking at it, Like I much prefer to be told oh, yeah, like we'll have an update in an hour, then oh, this is exactly what's happening at.
This moment, Like I don't care about that.
I want to know, you know, when's the next update going to be happening, more so than Play by Life.
Right like there too also always there are two types of communication, one internally and externally.
Both has to be I think that.
Suddenly more because you have the state like you have the ultimate stakedness, but like like both needs to be you know, always updated.
Both needs to be you know, always to the point so that like because.
In any of these conditions there's any miscommindation happening, then it will you know, just prolonged instead much wrong.
So it reminds me of the AWS status pages early in the days of AWS, like was always green, Like I would I would have put money that it was just a green icon there and there were no other options available because it was always green.
Right, I remember at that time, I think like, uh, I usually didn't sort of had a lot of confidence on that, I like down these some other or even Twitter was much more sort of a better way to you know, there's like actually a major and those status pages were like not at all.
I think, I think things have changed.
But you bring up Twitter though, and that's a really good point. I mean, I think for a lot of tech oriented companies that's a primary communication channel, you know, sending out notifications on Twitter or x and relaying information that way. And also like it's kind of sad to say, but that's also a good notification method of whenever your customers think something's going wrong.
Was I mean enough enough that I saw some products that specifically like we go around to social media and get the up real time status from potential users complaining because it's another source that you're not tapping into to actually let you know if customers have a problem, you know, they're not necessarily reporting it back to you.
This is the report mechanism, right.
I think these two kind of work.
That's a good point.
Yeah, And I think the companies, I think I've started to put artists feeds also, like for a longer time, and they have integrations with those feeds to their Twitter accounts or maybe some of their complements discord if they're doing a sas kind of a product or something like that, so that their customers are also updated by these platforms.
I mean accuracy, though, is what you're bringing up will And I feel like there's a huge challenge there realistically to like what do you what like what makes sense to even talk about and what should be intermittent hidden failures from an internal company standpoint, Like I don't want to see Amazon just being read all the time because some node in cloud front failed one request because the connection didn't go through.
Like how does that help?
So I mean I feel like or yellow all the time because there's always something that's probably impossibly problem. I think a single color there is is always wrong.
Right, and and that's why I didn't.
I think if you see AWS, they have although the period of time they have evolved their status page earlier like now, they have actually region wise. Also I think they have also started to do for so for some of the services, they have started to do more grummar scooping as much as possible, so like, uh, that's even for Slack. Also, like earlier they used to do only for messages, you know, if you have work spaces working fine.
They have not started to do for APIs. And you know, like even logging has been different, so every bit of different they have started to do so that like you don't have like a yellow for maybe just a small issue in or maybe just a small service in a small region. So if if that day page is more, if that status page is more sort of detay, then I think it probably helps to sort of give the right information.
I mean, I actually think AWS went a lot further here. They have something called the Health Dashboard, which figures out what services you're actually used, I think, and how that could be impacted you and then actually have messages there, which I mean is really what we all actually care about, right, you know, is there something happening at this moment which actually affects us that could be interesting realistically if we saw a problem, does this explain it?
Right? Right? Absolutely?
So. One thing we haven't talked about a lot is the post mortem, and I feel like that's all just like that is as much work as doing the incident response itself, but sometimes it gets overlooked because it's no longer a priority. Like once the incident is no longer an incident, you have to just be disciplined enough to run through the post mortem process. How do you how do you approach that?
I think the post mortem is like I'll say, like a chain your top in terms of the look like you have.
You know, you're doing us keep less right, You're going to the incident and maybe fix it also, but now you need it like a Disney we used to have like a day Maximat's idea or even there's a need to you know, come with that most modern document because they were kind of very bullish on that, like we want to know what cause issues, soctly we can fix
it r tomorre itself. So that like, like I would say, like that's where that's where the gap is, Like, that's where a lot of people drop where they don't want to do that work work, especially after a grueling period of you know, incident resolving process.
So but I think it's just about.
More of an education part or more of you know, realizing what you have learned from your incident resolving partly, h you have probably a tea has resolved a lot of incident, but if they have not learned anything from them, then's pretty much beastful because tomorrow similar or probably the same incident would occur, probably a different team of.
You know, in your team itself. So like, I think the value of the postem needs to be told pretty you know, clearly, and it's a very clear poposition.
I always feel like if you tell the engineer if you don't like, you know, hey, what we don't want is to you know, you spending this much amount of time again on a similar thing, you know, next week. So that's where postpartant can help. So I think think that value is pretty much it's I think it's it's important.
Yeah, And I think that's one of the really big values of using an incident response tool is it it will collect all of those data points and help you more easily see that you're having this common failure.
Yeah, over and over again.
That otherwise, if you're just tracking this in like Google docs or whatever, you wouldn't actually see that correlation.
Yeah, I think, like I think it needs to start with what information you are feeling. So like, like even visually, what we kind of help is to do the five buys. You know, what what went well? You know what's first of all, what happened, then what went well?
What can go with?
What we can do to you know, mediate or in future. So like having those information in places pretty much is
the first tip. So do you have the right way to analyze stuff in the timeline spart so you know if you have you know, if you want to do the slack conversations or if you want to do you know, want to see what happened from when the incident was triggered puill the result that timeline also helps you a lot, so that you know where a lot of time was being spent or if there's a miss there is like
a gap in the communication process. That is also that is also kind of visible from them or you know, what are the tickets or the action items that you have created out of it. So there are like I'm sell a lot of information in that postpotum document that can.
Help you to you know, analyze a lot of things.
And most of the times we have seen it are usually a communication you know, uh communication error that is happening. Generally, let's say you didn't sort of you didn't tell the team to you know, he ep vision celebrated, so you need to update. Things like that are the most common issues. So from you set up a process around that too. You know, next time, if you're great a version that
that modefiction is tent to different teams. So but avery bit of this is always and always, you know, you get these results or these jwills only after you have.
That document which has that entire information place.
So so and you can sort of you know, add those action items alves maybe like a short term actional item, a long term action item, and that.
That really helps. And the other part is to follow this up.
So like you create these documents, you probably have meetings also, but what after that, we need to sort of follow those action items to the last brick until those tickets are closed. You need to follow that up because otherwise this entire process becomes useless. So following that part to the very end is also pretty much important.
I heard a spicy take recently, and I want to I want to lay this on you. Every incident could have been prevented if you just had the right test.
Like I think every instance, as I'll say, like most of the cases that we have seen is usually the communication part, Like that's the most common thing that we are also, uh like like I think like like I remember case like terraform has a lot of issues. Everyone kind of has a different story.
To it, but no argument here.
So we like one time what we saw is like if we update the security groups via terra form, what CBS was.
Doing was like it removes the security groups first. Like let's say, if I want to add a security group, it.
Removes the security group first, the existing one, and then it adds the list even though I'm adding a just one exact Now, what happened was in the meantime and it's removing those security groups. So we was down, like the service was down for let's say two three minutes.
Now, this is something that we kind of.
Like that happened to our two and that's a potentially like you know, ticking bomb, which can actually you know, happen any time to across any games. So even if we just communicate like, hey, this is what we have seen, this is what we had in the experience, and if we just relate to the all the engineering teams, that issue would not occur. So that's usually the case since we have seen Like if even if the communication is proper, i'd stay like excu s differicental times of incidents wander.
I still I still can't get over the fact that terrorform does that by default, Like it seems like something that no one in their right mind would have designed to have the default be first delete all of the resources and then recreate them. I mean that just seems like it just backwards to be like, isn't isn't the common wisdom in in operations to okay, first we'll create the new things, make sure that it works, and then switch over to it. Why is the default delete?
I don't know.
I I maybe maybe I just need enough.
Coffee or something and someone and it will just magically insight will come to me.
Oh maybe it's that's a a W is seeing. We don't know. I don't remember, no, because.
Like cloud cloud formation and CDK and everything like that, it's not it's just the order in which the s K is being executed. There's no fundamental reason why it has to be that way.
I want to come back to your your spicy take Warren. Every outage could be eliminated or avoided with the right test. I mean, I think in theory that's true, but the like the practical steps of executing that make it not the right answer for a lot of people. But I think it does highlight something that I don't think I've ever talked about in terms of incident response with anyone before,
and that's identifying what your risk tolerance is. Because for a lot of companies, having some downtime is really not a big deal. In other companies, it is a big deal. Like I worked for a while in a medical company where downtime for us meant that patients could potentially die, So we were kind of risk averse there. But in other places, you know, I worked for a company that
built a fitness app. You know, if we were down, somebody had to figure out how to use the treadmill on their own, I think they're going to be okay, But.
Like in those.
Yeah, but in those two extremes, you know, there's like there's a different risk tolerance for how much downtime you're willing to take. And I think that is probably something that maybe needs to be talked about more by companies when deciding how much downtime we want, Yeah, is down every weekend?
Right, I think both I think even in the downtime as well as I think it brings to a point also about alert fatigue or on call forty Like people kind of have very lower thresholds for a lot of things and over the time to realize like probably we don't need a lot of lower thresh shoes, so uh, Like, I think alert generally happens when people, when teams are starting to have their on call process in place, they put alerts on a bunch of things and over the time for we don't need this alert or that is
probably what we or we can raise the thresholds, so like like like click really be also provide these values like you can innotate alerts to sort of analyze and probably reduce some of the alerts that you don't even.
Need or you can probably increase trasuments.
So similarly for formably incidents, also you can define or change your you know, sexual.
Values over the period of time.
Uh some some some companies can afford to have incident response only during let's say business hours, they don't probably they can afford to maybe don't do it during weekends or night times. But some companies can't afford for even a firm minute. So absolutely depends on completely company, type of product or type of service detment.
Yeah, I mean, I think the same thing with like the dependent on alerts from a security standpoint, which in my domainment we.
Talk about a lot like how how much do you want? Right like how much is important? How much is relevant for you?
And and maybe the you know pajorly and at all you know, help you actually identify after the fact how much you have. And then point the ROI is super critical to actually evaluate because you know, trying to actually sort of duplicate production in a way to actually test to see what happens at that scale at that moment, and there's no way with cloud providers to uh, well
practice what does capacity constrained look like? And then if I mean your capacity constrained because there isn't another bare metal device available. There's no there's no alternative. Oh well, you know, well we should be you know, multi cloud provider. Like it is never the answer.
I mean you can have you know, backup das also, you can have as much as possible, but like there's no sort of answer.
H Yeah, there's some things like actually pick what your solo is going to be, what your objective is going to be for uptime or incidents, and then make sure your strategy actually includes that and handles it, and then measure it based off the number of incidents you get rather than saying, oh, yeah, we should know when the memory goes about ninety because then it's it's bad.
Apparently.
Yeah, I mean it always you know, gets updated, it always gets you know, with the time it's probably from your your port is growing, your customers are doing to kind of get every passage.
Of switch your top tips for someone who is not satisfied with their current incident response uh program or or software, I would.
Say, like, like I think the entire fighting I think like you can always see different parts to it. The first part is the you know how easily your team or anyone is able to report the incident. So, yes, you have automated alerts on TV's or on from easy tools and all those parts, but they'd say not everything can be automated.
You need to have you know, correct way of identify of you know, reporting issues.
So if you have customers support or if you have a product team of let's say someone wants even if even if let's say, if you have you know their environments or three fraud environments you want to report issues there, you need to have a good, good way of reporting that UH and hope and have a process that the correct uh the issue is reported to the correct team as quickly as possible. So time to trigger the incident or time to you know, you know you call that
on call should be as quickly as possible. So identify the blockages in that there there can be blockage is in you know, in set uping this process office.
And the other part is let's say, once the incident has been triggered or created, what to do? So if you if if if you.
Feel like if the calls feel like a lot of work outside mitigation, uh, if they like you know, if every time you are having an incident, you if you're just running around and probably adding you know, calling people and just figuring out, you know, what's the stackers.
Or adding other team on calls all the time.
Figure out these kind of blockages in your processes and try to streamline as as much as possible so that like on call or whoever is other stakeholders, can you know, focus on solving as much as possible if you want to have.
Like ah, if that is still like.
Taking a lot of time, maybe set up like a team of an eyework person who is actually handling all the incidents and he's he or she's actually dispatching the incidents to a correct team.
And doing like a supervision of the entire process.
And the last one is to see whether are you doing the post modems correctly, like uh, you know, are you doing it all or not? Are you actually learning the you know, uh learning from your incidents? Have your repeated incidents have reduced over their time online? I say, like that's the most between a lot of a lot of companies kind of focus on entity. I think that's not probably the right metric. The right metric is to
see how many unique incidents you're getting. I think if if if if that's if that is fine, that's fine. But if you're getting the repeated incidents time of her time, something you could would have over then your incident responded process is like the first model processes.
I think that that's interesting.
That's a good point.
I mean, I really wonder how many, like are people hitting the same incident over and over again? Like I my my guess would be probably not exactly, but maybe correct categorization would really help spill it. Like, you know, is it is it the same part of your framework, code based or or component? You know, if you have even a monolith and not micro services, you still have broken out components. You can at least target it down to is it the same component that's causing the problem
all the time? Uh, as far as a place to look and invest in rather than oh, you know, it's just something happening within our whole system.
Right, Like I think, so maybe you know that the on call have the run book for when he was solving an incidents? Is do they had the right tools for you know, just seeing or for the service the particular serve it doesn't even have a dashboard, but doesn't have any playbook, uh, which can help on So there are like I think a lot of learnings, uh, which
any or or any engine team can do. Uh and see, you know how much like I think at the end, it's all good how much we can help the on hulls as much as possible.
So yeah, I mean, if we're seeing the same incident over and over again, we should at least be able to brag about our meantime the resolution decreasing because everybody instant service to restart.
But maybe that's a good point though, right maybe that maybe that's the whole point.
Right, like you you don't want to have that actually decreasing because right then then there's like, you know, it really points to a different problem. It's like if you have run books, you must be because you hit the same problem over and over again. And so rather than having the run book, it'd be better to eliminate where the source of the problem is coming from.
And that's why I didn't.
Like not like I think it's always divided for you on you know, with how good does a MPDIA matrix actually want to like trust on it? Like, because it's usually counter intuitive if you see, like as as you rightly say, say, let's say let's say a company has dissolved most of the incidents, like they have resolved it
to the correct points in six months of time. In the seventh and in the seventh month, the engineer team doesn't have the same incidents or similar incidents, but they have only one new incident which kind of took a long time because that was like a unique se So in that case, the MPDR is too big because a new incident came with a lower frequency, lower number of times,
but it took a longer time. But can we say that that the engineering team had a you know, bad state of incidents hygien, No, because they had kind of resolved most of the incidents that have occurred in the past and those are not occurring now this is like a new one. So that's why I think like MDA is always not the right victory to see in terms of incident hygiene.
Yeah, I think the ERA budget, according to your SOLO is a much better one in this regard unfortunately. Yeah, but I'm I'm totally I mean, I think that all the door metrics sort of have that problem in a way if you measure them purely or just people in general, rather than how they're actually relevant. Like I I remember working with one company that they were measuring even in cycle time, but they were using feature flags and not
including that in the cycle time. So I'm like, yeah, your code is going to production, but no one's using it, So what's the point.
What's the point?
I mean, yeah, I mean measure then also measure the cycle like the cycle time on feature flag removal. That's going to tell you a lot more about your success.
Right, I think, like you know, we have seen a lot of tools on so we have seen a lot of tools maybe just you know how many commits you have pushed, So I think everything has to be you know, read with a lot of context, with a lot of corns. Is also not just because it can change based on a different kind of uh things that are happening.
Yeah, I know we're getting close to the limit on the time that you've got with us today. Uh maybe there's some last words and then we can move over to picks. Anything you want to share.
No, I think like this was pretty cool. Uh.
Like I think like with with pajor L also we have we have seen a lot of different and unique cases h and it's good, like this is something that which is very close to you know what we have seen like what we have failed, and like mean, we're like to help companies to sort of stream and this entire process as much as possible.
It's got to be super interesting too to see how companies incidents actually look like Oh for sure, Yeah, I.
Think like what we have realized is like every company needs their kind of like every company has different processes, probably because of the state of their product, the state of all size. And what's what we have always ensured is like whatever your process or whatever you feel like is the most app will not force a tool of
that will adapt to your kind of processes. We'll just make it more automate in most because we know, like you have set up some big even your set and we know that you know that uh sort of some of the big spieces of it. So there's no one particular way of doing things, but whatever it may will.
Hit well said, well said, So what do you think, Well, should we should we do the picks?
Let's do some picks.
Okay, I know you put me on the spot anyway, so I'll just go first.
Uh my, My my pick for day's session is a book called Radical Focus by I think it's a Christina Vodka it's a it's actually fantastic. It's a hypothetical story about how to actually uh set priorities using okay, r S or KPIs or whatever, MBI whatever you want to call them, honestly, and how not to do it and lessons learned from that. It's it's super relevant no matter what level you're at, realistically, like even at the team level, it's super interesting to think about, like how many priorities
and what should our focus be on? How to think about that because I've seen so many teams, so many companies have like, oh, yeah, we have ten initiatives for this quarter, and I'm like, you can't. I bet your engineers couldn't even tell you five of them, Like it's just too many. And I think it's a great story about how to actually think about this and what's relevant.
So highly recommended.
Dude, your picks are always so relevant. I feel like mine are just like better, but yours are like, oh wow, that could actually work and be helpful.
I mean, I don't know, maybe being I'm being lazy by picking easy things. Well, you know, I'm when I'm year two of a host here, maybe I'll have run out of.
Things and then I'll be onto the I don't know my weights that I've got in my other room that I'm using.
All right, Fali, what you bring for us for a pick today?
I that's a little thing is one. I think.
One I've seen like a documentary recently which was how Toyota Big Stuff and a lot of things was very interesting. I'm forgetting what what they call it, but essentially what they they is. And the third is like known for it's like you know, building bug.
Free products manufacturer. Yeah for sure, yeah. Yeah.
And one of the one of the things that they have always is like no matter where their manufacturing nits are, no matter where the what they if there's an issue, it gets reported to the topmost year like immediately, like with with proper clarity and and and that's how I think, like communication becomes so much important and like that kind of solves a lot of things.
So like it, I think, and it's just very fascinating how kind of.
They actually they actually have these like cords on the manufacturing for called and onlines that helped. Yeah, they stop the whole manufacturing line at once. You know, hey, you know, no more pull requests at this moment. For the whole company because there's something critically wrong going on, Like could you imagine.
Yeah, yeah, I think I think on line was I was different to him. That's super super.
I think that's such a simple kind of technique that any company can sort of have, Like I make such a simple forceses but very.
Yeah. One of my customers was a company that provided seats to Toyota, and it was wild because they would get orders like Okay, we need three hundred and seventeen seats that are beige delivered at ten twenty seven am. You know, like the level of specificity because they have that that just in time manufacturing, like we need these at ten twenty seven am. And and this was a smaller company, so like the level of pressure for them to meet those requirements was just through the roof.
I mean it makes a lot of sense too if you think about it, because they see inventory storage as a waste, as a cost to them, and so they don't want to have it stored at You're the inventory for them. You know, they're they're going to the shelf and they're pulling it and you are that shelf for them.
Yeah. No, it's awesome.
Yeah, I feel like, you know, we were talking about this before we started recording, about how my picks are just kind of out there. I feel like this one is going to be unlike the the crazy scale, this one's going to be hard to top. And yeah, I'll
just get to it. So I read this book. I just finished it up a couple of days ago, called The Sacred Mushroom and the Cross by a guy named John Marco Allegro, and I would be tempted to call bullshit on the book right away, except for the fact that this guy spent fifteen years deciphering the Dead Sea Scrolls.
And so if you're not familiar with the Dead Sea Scrolls, it's a set of scrolls that were found in Egypt, I believe in the nineteen forties that were thousands of years old, and they contained some parts of the New Testament Bible, but they also had other stories in there as well that weren't included in there, and so he
deciphered them. But this book, The Sacred Mushroom in the Cross, he basically goes through this book showing or arguing that a lot of the stuff written in the Old Testament and the New Testament and some other religious books as well were not factual base, but they were actually like a play on words referencing psychedelic mushrooms, and that the whole religion is based on an ancient cult or ancient
culture that worshiped psychedelic mushrooms. And it's a wild read. Man, It's very hard to read because of all the references he makes to like the Aramaic and the Semitic languages. But the big takeaway for me was, you know, you're reading through this and he's like, oh, so, well, they said this thing and that's actually the you know, the ancient Sumerian word for this psychedelic mushroom, and like everything points back to being the name of a psychedelic mushroom.
And I was like, dude, how is it that we know so little about the ancient Sumerians, but you know the four hundred different words they had for psychedelic mushrooms. But then at the end of this book there's like a chapter. I can figure out who wrote this last
chapter because it wasn't John o'leegertt was someone else. But the guy was talking with his wife, she was from Russia, and they came across a field with a bunch of mushrooms, you know, and he's like he was an American and he's like, no, don't eat those, they're all poisonous and stuff.
And she's like, no, this one is, this, this and this, and so it turns out in Russian they have like an endless number of words from mushroom, but in the US and in Western cultures, we have, you know, like three like toadstools, mushrooms and and whatever else.
So it's a yeah, that's a European thing because people actually go pick out pick mushrooms here, and so knowing which ones are poisonous the same thing.
You know, once I've moved here, I learned all about that.
But I think you've meaned yourself well, because now I see instead of you know, just this, but instead of aliens, now it's just it's mushrooms.
Right, aliens? The last one, well, no, it's the the guy from.
The Ancient Aliens TV show with the big hair. If you've ever seen that meme.
It was the History Channel, there was Ancient Aliens and yeah, it's like the pyramids are landing platforms for aliens, and you know, well's here, you know, trying to.
Sell us on the fact that the religious cult is of mushrooms.
I mean, yeah, good reading.
I added to my list, So thank you for that.
Yeah, let me know when you when you get through it, I'd be interested to talk through that with you, because there's like some parts of which you're like, okay, I see how you can get to that conclusion, and there's other parts who're like, come.
On, is it like pretty famous kind of book it was written?
I think it's recently. It was recently. It was published in nineteen seventy, so it's an older book. And then he got a lot of hatred and supposedly it was very detrimental to his career. Go figure, who would have thought that, you know, claiming Jesus was a psychedelic mushroom
with detrimental to your career. But anyway, but I think it's gained in popularity over the last couple of years just because of the shift in the things that we're seeing, at least here in the Western world, where people are kind of changing their opinion and approach to things like to natural medicines like psychedelic mushrooms, and you know, the legalization of pot, and now in Oregon and Colorado there's actually decriminalized centers for using mushrooms to treat like PTSD
and memory issues and things like that.
You're really close to Canada where that's been a huge topic in the last years.
Yeah.
Yeah, so I think that's been a key to the book gaining new popularity. Yeah all right, so there you go. So now the challenge is on next week? What am I going to come with a pick that tops Jesus as a mushroom?
I actually saw one movie which is on my mind is Inside Out too. I saw, I think.
Last night, and I think it, Uh, if someone wants to make highly recomm I think you can sort of feel a lot of emotions, uh for for.
That's that is something which is like just all my mind and what was the name of that one? Inside Out the second part?
Yeah, awesome, And with that done, I think we have an episode for Thanks for joining us, man, This has been a blast. Really appreciate having you on the show.
Great, thank you.
Yeah, I think I think, uh, it's been really great, had you know, wonderful time just to chat about incidents and a lot of other things and sharing each other's you know, personal experiences. I think this is something like every OnCore and every or even every developer has their own personal experience what they want to share.
So it's been a really good.
Child, awesome cool. Thank you again, and to all the listeners, thank you for listening. Appreciate y'all and be sure and hit us up if there's anything we can do for you, and we'll see y'all next week
