Bridging Processes and Improving Incident Management - DevOps 187 | Adventures in DevOps podcast

00:14

Hey, what's going on, everybody? I am the host of Adventures and DevOps and wait, yeah, that's the right channel name. Sorry. You know sometimes I get that mixed up, like almost every time I think, so if you're a frequent listener to the podcast, you know that I messed that up just about every time. So thanks for bearing with me on that. But today I'm not going to miss this part up because today I have as our guest of Drew Stokes. He's the senior manager of software engineering for

00:44

Page Your Duty, and we're talking about incident management. And that's one of my favorite topics because it just goes so deep and it crosses so many disciplines across your infrastructure and your application teams and marketing and sales and the executive suite. Like depending on the level of the incident, you're just all overboard, all over the board with this. So Drew, welcome to the show. I'm excited to have you here. Thanks, I'm excited to be here.

01:11

It's going to be too well right on, So tell us a little bit how you got into the field of incident management or incident response. Oh that's

01:23

a good question. Okay, Yeah, So I've been in tech for a while, like most people here, I think it's been something like sixteen years, and I think originally I was kind of trying to figure out my way helping folks out with technology and networks, and then I got into front end development and moved into back end and then dropped into SRE and that's when I kind of really got familiar with not just the process of mitigating incidents, but

01:52

actually managing them and trying to learn from them. So I did that for a while, and then I think for something like the last eight years, I've been primarily focused on people manager role. And there's a lot of ways in which you know, people managers are involved in incident management as well, both as stakeholders but also you know, facilitators and folks who are playing a

02:13

supportive role for people who are responding. So kind of been in that space for a while now, and back in I think it was May of twenty twenty one, I joined a startup called Jelly, which was founded by Nora Jones, who's the author of the Chaos Engineering book and the founder of the Learning from Incidents community, and that was kind of where I really dropped into you know, incident management in general, but specifically this opportunity to kind of

02:40

not just resolve incidents, mitigate the issues, but also to learn from them in order to improve future response and organizational performance. So there's a lot of really interesting ways to think about the space, and you mentioned at the beginning

02:53

it's really important. And part of the reason is because it's so cross cutting, right, because incidents or a lens through which you can see the way that your organization and your people operate, and that applies to customer service, that that applies to executives and to the folks actually responding to the incidents. It's a really interesting space with a lot of opportunity, which you'll you'll hear that word a lot in this conversation. We refer to incidents as opportunities.

03:20

Oh for sure they are because you know, one of the things that I think about a lot is just because we're in tech. You know, we've all done the Google search for is such and such service down because you're having problems and you're like, did I do something wrong? Or are they actually

03:36

dead in the water right now? And I think that's like, to me, that's one of the hallmarks of highlighting that your incident response plan is really really well done whenever your customers know that you're having an incident incidents because you told them versus them discovering that something was broken. Yeah, there's a there's a level of well. So, so one interesting aspect here is you mentioned another cross cutting function there, right, which is you have internal stakeholders and

04:10

external stakeholders for these types of things. But there's also this layer I think that you're referring to here of like operational excellence and observability. Right, do you know that the system's broken before someone tells you that the system is broken, And a lot of the ways in which you can improve that process is

04:28

through the learning process after the incident. Right, So, if you have an incident, for example, where a customer reports an issue, looking at the details of that timeline and what actually happened can help you figure out where you need to add additional instrumentation or alerting, or how to adjust your team's processes, you know, your software development life cycle or your release process to

04:50

better account for those kind of unpredictable behaviors in the system. So really interesting, like complicate, you know, when you're dealing with not just complex software systems, but also complex organizations and groups of people, right, really interesting opportunities to figure out how do we kind of iteratively approve improve our understanding of the system and our understanding of failure mode so that we can kind of inspire

05:15

customer confidence and trust, right, letting them know that there's an issue before before they don't. Yeah, for sure. So you early early on in this you mentioned something I want to highlight, mitigating an incident versus managing an incident. Can you elaborate on the difference between those two, Yeah, that's a that's a great question. So there are a lot of different aspects of incident management in general, and I'll try to like decompose them in a way

05:46

that makes sense here. So I think when you just reference detection, right, so there's a there's a phase there of like understanding whether or not there's an incident and trying to do something about it. And I think when we talk about managing incidents, what we're talking about is providing information and coordinating folks in incident response. Right. Mitigating an incident is doing something to address the issue and get the system back to a stable state or you know, performing

06:15

in a way that's expected with regard to external stakeholders. But I think for us, managing an incident is really about investigating what's going on, getting the necessary folks with the subject matter expertise into the room to contribute to that, coordinating that group of people in you know, large organizations are really complex incidents.

06:36

Sometimes you have multiple work streams of investigation within an incident, and then communicating status out to stakeholders, your customer success team, your executives in a way that allows them to stay informed but does not have them jump in and start, you know, trying to get involved in the process in a way

06:54

that can you know, add additional complexity to the overall incident managed. So from my perspective, I think management is a lot more about the process of coordinating and communicating during an incident, and mitigation is about that moment when you've kind of identified and addressed the issue to stop whatever impact is associated with the incident, Right, that's your signal to your external stakeholders that we are in a stable state, we've seen things are good, and there are various other

07:25

steps after that, But for me, that's the primary difference. Yeah, Yeah, I think that's really important for someone who's not done a lot of incident responses to understand that the management of it is equally important as the mitigating of it. And in many of the environments I've worked in, those are actually two key roles for any incident. You have the first responder who's trying

07:56

to find the cause and restore the service. But then you know, alf what have your primary communications individual who is getting the information from that first responder and relaying it out and doing still in a way so that everyone feels like they're in touch with what's going on and they aren't going around the back door sending DMS to the first responder to get status updates. Yeah. Yeah.

08:24

One thing we talk a lot about is kind of this this incident management maturity model, and we think about different buckets of you know, engineering teams or organizations with regard to kind of how they approach this. And I've been in you know, multiple layers of the lower maturity model, and sometimes it can be really difficult, yeah, to even understand who's doing what and who do

08:46

I ask for an update? You know, I've got a customer who needs an update now, and we have an SLA in the contract, what's going on? It can be really difficult to even know who's doing that. And I think you find that in you know, incident response tooling like Jelly,

09:00

those roles are actually codified in the process. You're assigning an incident commander, you're assigning a communications lead to try and take care of that external communication of here's the person to you know, connect with if you need an update, or here's the person responsible for managing this incident, so that if you join in, you can say, hey, I'm here and I know about X, you know, can I help that sort of thing? Right? And

09:24

so that's one of the things that Jelly does for you. If you need to improve the majority, improve the maturity of your incident response playing, using something like Jelly can kind of help you say, hey, here are the here are the people and the processes you need in place, and provide a

09:43

framework, right. Yeah. I think I think like every small organization goes through a phase where someone opens a Google doc and writes down a run book for how to run incidents, right, And so what we wanted to do is to provide some of that for you in a way that didn't get in

10:00

your way. So we've got a bot in Slack right that you can use to declare incidents as sigence, stakeholders, set stages, communicate status, all that sort of stuff, so that you don't really have to go in and kind of trial and error that Google doc and try to get folks enrolled in the process. There's just a thing kind of nudging you along the way and

10:20

helping to offload some of that cognitive burden. When you're in the middle of managing an incident, right or typically as an incident commander, you're thinking about a lot of things. Sometimes you're also trying to mitigate the incident. Right if it's two am, you may have a stretch of time where you're doing

10:35

everything on your own. And so I think the more folks can find mechanisms and processes that help them reduce the number of things they're doing during management so they can focus on getting the right folks in the room and finding the means to mitigation, the more successful the response process becomes, which results in better data for your post incident analysis, and then you're you know, cross the

11:03

incident learning over time. Yeah, it's one of those things that like we've we've all done incident response wrong enough time enough times that we we kind of know, So it's I think it's one of those things like you know, like in software engineering, like writing logs has been done for decades now, so you don't write your own logging engine. You just pull in a logging

11:30

library because you don't need to reinvent that wheel. And I think incident response is one of those we don't need to reinvent this will we can just buy

11:37

a wheel that's already built. Yeah, we've we've we actually have a couple of customers of Jelly who are trying to replace their wheels, right because you know, some of some of the large organizations who started this process ten years ago had to make their own I used to work at New Relic and we had a slock bot we called nerd bot, which was our incident response you know, facilitation tool. But there's a cost associated with those things, right,

12:03

You have to maintain them over time. Oftentimes they kind of fall to the bottom of the priority stack, and so iterating on your internal process becomes really hard. And I think that's where if you go with something you know, like Jelly's incident a response spot, which is you know, fairly opinionated but narrow in scope, right, it's just here are the set of criteria

12:22

that we use for this thing. With some customizable features like automation, then you don't have to kind of invent that wheel and then reinvent it iteratively for all time. And you also don't really have to, you know, answer a lot of those questions when your incidents become more complex. There's like different phases of your incident response process. When you're a five person team, you

12:46

jump in a zoom call, right, and you fix it. When you're fifty people in a major incident room, it's a very different experience and requires a different set of skills and supporting tooling. So, yeah, cool, you mentioned a couple of times the post incident response plan, so elaborate on that a little bit for me. Yeah, this is another area where I think everyone kind of starts with a recognition that there's more that can be gleaned

13:16

from these experiences. Right early on, you have an incident, you respond to it, you fix it, maybe you shoot an email off to folks saying what happened, and you know, here's what we're going to do to

13:24

address in the future. But as your system complexity grows and as your organization grows, there are you know, many more opportunities to figure out how to change not just the system itself right to you know, write better logs or increase visibility into the system's behavior, but also to change how the organization is

13:46

structured around those systems. Right. So, one anecdote I like to share is at my time in a previous company, we had this custom feature flag system that had been around for I don't know, it was like eight or nine years or something. Everybody wanted to get off of it. It wasn't great, and every time there was an incident with that system, someone from the network engineering team would be pulled in because they were one of the original

14:11

authors. They had nothing to do with this system anymore, but no one else knew how it works. And so if you're just responding to and mitigating incidents and not looking any further, you don't see those types of organizational misalignment right where you've got a primary owner or subject matter expert that is, you know, accountable for a whole slew of things that have nothing to do with

14:31

this foundational service that's critical for business function. If you've got a feature flag system in you know, a fourteen year old code base it's got to work. So I think when we talk about post incident learning, this is this is the next phase in maturity. Right, you figured out your response process, you know how to get the right folks in the room, you know how to move toward mitigation, and you're starting to capture some of the you

14:58

know, follow ups that you want to take. Maybe we need more ossability. Maybe this library and our services out of date, and if we updata we'll get better performance. Like that, But it goes beyond some of those follow ups, and as you start to cultivate a process around this, and

15:13

there's a lot of different ways that folks do this. You know they're refer to on this post mortems or learning reviews, or you know, sometimes you're just getting in a room and talking about the incident without the structure, you start to uncover all of these really interesting aspects of not only the responding team, but the organization overall. And so some of the things that we're most interested in learning is, you know, what did folks know when they responded

15:39

to the incident and what did they not know? Right? What are the ways in which the folks involved communicated successfully and maybe not so much? How did the organization's processes contribute to or prevent aspects of a specific incident. It's all kinds of interesting stuff to dig into, and you can look at it

16:00

from a bunch of different angles. So we have, you know, a lot of examples of our customers creating multiple investigations on an incident where a person A and person beat both investigate and then you see like where the differences are, and I think that turns up a lot of interesting stuff. We've taken the approach in Jelly of writing incident narratives, so you know, post learning,

16:23

review, post mortem, whatever you want to call it. Our feeling is that incidents are stories and the way that people connect with information and learn is through storytelling. And so we've taken the approach that, you know, we want to provide folks with a tool to tell a story backed by evidence, right, what was actually said during the incident, what you know, metrics or data we were looking at, but to kind of nudge folks in

16:48

the direction of sharing their perspective and their assertions about what it means. Right, when these two folks were talking, they were talking about different aspects of the system, and they didn't realize it what does that mean, right, what's the opportunity there to improve the incident management and the way that these teams are connected and communicating those sorts of things. Yeah, you see that a lot whenever you have people with different disciplines or different backgrounds, you know,

17:18

a networking background versus a software engineering background. And I think that highlights one of the one of the arts of post incident response is creating those follow up items and getting those the right people engaged to recognize, prioritize, and address the things that you learned from that incident. Yeah, and you know that

17:45

you mentioned like different disciplines. There are different different disciplines within the responding team, but there are also incidents provide this really unique opportunity to consider the different disciplines across an organization. Right, So for your major incidents, it's not just your you know, senior engineers from a specific team. It's also your customer support support folks on critical accounts. It's also your group leads and your

18:08

executives. All of these people have different priorities and perspectives and understanding with regard to the impacted systems and the impact on the business. Right, if I'm responding to an incident, my goal is to make the chart go down, whereas my executive or salespeople's goal is to minimize the costs associated with customer impact. Right, We've got slas with our customers for uptime, and we need

18:33

to keep that in line. And I think the different perspectives and priorities there result in that same kind of differing perspective that I mentioned earlier, where I may look at an incident and think it means one thing, but my group lead or you know, my sales associate may look at it and think another thing. And that opportunity with you know, incident narratives or post incident learning is to try and bridge that divide between those different perspectives and help everyone cultivate

19:03

a shared understanding of what it means across those dimensions. Right, this is what this incident meant for business impact and process, for customer satisfaction, and for the you know, sustainability of our you know, critical services something like

19:18

that. Yeah. I've even worked in organizations where it involved the marketing team because they were out scrolling Twitter, you know, catching tweet going on about the incident and responding those and trying to do trying to minimize the blast radius there. Yeah, this is a whole other aspect that's really interesting, which is like where do incidents come from? Right? Who says what an incident is? We've taken the approach that anyone can declare an incident. Some organizations

19:48

we've worked with are very narrow in terms of who can declare them. But yeah, customer success marketing, you know, random person from the internet. There are all sources of potential incidents, you know, automation and observability, those sorts of things, and so it's you know, the the once you start thinking about this space and you start exploring ways of benefiting from these lenses on current state of systems and organizational process, you start to see like there

20:18

are opportunities everywhere. Right at Jelly internally, we create incidents for things that are not incidents. If we have a release going out that we think might be you know, impactful to customers because it changes some aspect of the user

20:33

experience, that's an incident. If we're trying to better understand database failover in RDS, for example, we run a game day as an incident, and doing that gives you this repository of information that you can use again to build that narrative and make those assertions about where are we and where do we want to be with regard to how we're operating and the health and stability of our

20:56

systems. So that's a really interesting anecdote about marketing. I love when those things come in from places you don't expect, right, You just kind of get a message from someone that you haven't met before and they're like, hey, there's something going on yet we'd better declare Yeah, yeah, you see someone from marketing enter in one of the tech Slack channels and that this is

21:17

not going to go well. So I think one of the cool one of the cool types of companies I like to work with fit the model of Jelly because you actually use your own product, you know, like when you build and release it, your team actually uses it to manage your own incidents. And I think that is really really cool because you get firsthand experience of what it's like to be your own customer, and you can understand what your customers

21:49

are actually seeing when they're trying to use your tool. Yeah, one thing that was really interesting thing in the early days about working with our customers. It's interesting now as well. We'll have to talk about page duty at some point later. But one thing that was really interesting is that the customers that we work with are really passionate about their process and those opportunities to learn, and so we get to work really closely with them on you know, understanding

22:18

their process and building tooling it works for them. We work with F five and Indeed and Honeycomb and Zendesk. These are like, you know, large influential organizations who are kind of at the cutting edge of this process. So there's this bi directional information share where you know, we can build features that support those organizations processes, but then we can also adopt some of those organizational processes because they make a lot of sense and they work well for us.

22:45

I was we were doing a product demo for an important group of people the other day and we noticed some lag in one of our features and I actually declared an incident with Jelley about the performance of the incident was cons tol jelly, and we ran that in parallel during the demo, and it was there was this moment where I was just like, this is so cool running an incident with the tool that we're demoing to people, and there wasn't actually an

23:11

issue. It was a Wi Fi lag you know thing, So everything was good and that's okay. That's also a learning opportunity. But yeah, it's been really exciting to kind of watch things evolve over time and be a you know, benefactor of that system as well as trying to evolve it for our customers and find that alignment across across orgs, which is really unique. Most

23:36

of your incident response and post incident learning is within an org. We've had the unique opportunity to kind of extend that outward, so fun right on. One of the things I'm interested to get your opinion on is I over the years, I've developed the opinion that there's a difference between mitigating the issue and

23:59

resolving issue. And I refer to that in in terms of, like during the incident, you know, you have you know, to say your API service is slow, it's okay during the incident to throw more servers at it. You know, we're going to we're going to mitigate the issue by adding more servers or adding more memory, or do something to make the symptoms of

24:23

the problem go away. But then there's this like defining moment of okay, customer impact has been resolved, but now we've got to go back and find the root cause because adding the additional servers did not fix the issue that fixed the symptoms. And I'm interested to get your opinion on that. Yeah, it's a really good distinction that you're making there, and I think it has

24:48

a lot to do with prioritization and understanding. Right. So oftentimes, especially in major incidents, there's a priority involved there to minimize customer impact, right, because customer impact means lost revenue. Incidents are expensive both in terms of time and you know, customer satisfaction and trust. And so I think there are kind of two ways in my experience that you mitigate before resolution. And the first i'm mentioning now is about minimizing the impact in favor of kind of

25:18

getting things back on track. And so, like you said, throw some additional servers at the API and that'll address the symptom, but we still don't understand what's going on in the hood, right, And so I think the

25:30

second reason, sometimes you can choose not to mitigate an issue. I've been in situations where we've had customer impact, but the priority of understanding what's going on has exceeded the priority of needing to address that impact, maybe because it's like, you know, one user at a customer rather than all customers in a major incident. And so that second bit I think is really interesting because you can use the incident and the the levers you can pull during the incident

26:00

to create the conditions for learning while it's happening. Right, So if you mitigate the incident with the API, it means that you have an opportunity to explore what was actually going on. Maybe you isolate one of those servers and

26:10

you start to dig into you know which function calls. If you've got distributed tracing, which is amazing, you know which specific function or endpoint is causing delay in the response, right, that's causing a delay across all responses, And you can kind of take advantage of that system state, which you know, if you reboot the servers, if you add a ton of them, those conditions go away and you lose your opportunity to understand what's going on.

26:37

And so there's a lot of different ways to look at it. I think mitigation and resolution for folks outside of incident response, that's a mental framework for understanding are we good now and are we good for the long term? Right? But as a responder, those two events are really key in terms of communicating within the response group what our level of understanding and what priority decisions we're making with regard to customer impact or you know, system stability or what have

27:07

you. Sometimes incidents are not resolved for days after you know the actual incident. I've especially for for large, complex incidents. Sometimes you just have to get things to a steady state and let them stay there until you have chance to enroll more folks or get a deeper understanding of what's going on. And sometimes those fixes are not things you can roll out, you know, as one hot fix. Sometimes they are major upgrades or major changes to kind of

27:33

foundational business logic. So I'm glad you made that distinction because they're they're really important, and I think oftentimes folks outside of the incident are just like, when are we mitigated? When we mitigated? But you can't you can't lose sight of that, that time frame between mitigation and resolution, because that's where

27:51

a lot of the you know, exploratory understanding comes out for sure. And one of the things that I try to insist on is that mitigating the issue, were allowed to make live changes in production, but the actual root cost fixed has to go through our normal development cycle of making the changes in DEV, pushing the changes to a staging environment, validating them, and then promoting those changes to production. So it has to follow that flow. Yeah,

28:27

and that's that goes back to that prioritization opportunity. Right. So once you've kind of addressed the business impacting issue, then you've got to get back to your fundamentals, right, and your business processes and compliance and all of that.

28:41

And so detangling those two things allows you to respond in a way that helps the business, and then address the issue in a way that helps the business, and do those in different ways, because especially when you're when you're further along in your maturity model, when you're a large organization, there's a lot of things that can hands stand in the way of quickly addressing an issue. Right. If you don't create a path for doing that, then incidents

29:07

end up taking longer and having a lot more impact. So yeah, and the other thing we've learned in all of this is that every organization is different, Right. Some organizations have response processes that specifically call out different ways of

29:22

mitigating impacting issues and different ways of capturing follow ups for those. Right, Sometimes the incident's not closed until you've resolved it, and sometimes it's closed at the point that it's mitigated and you've captured the follow ups you want to take action on. You know. As a result, sometimes folks keep talking about the incident after it's been closed and they want all of that for their post incident learning review as well. There's just so many different ways to tailor this

29:51

whole incident management process to help an organization be more successful. Yeah, one of the places I worked years ago was is that a healthcare provider, and we did we provided medical services for hospitals across the US for trauma patients, and so every incident that we had, whenever we broke out an incident room, we actually had a person from our quality team who would join the call as well and let us know, like every five or ten minutes, how

30:22

many patients across the United States couldn't receive life saving healthcare because our stuff was broken. And so we had a very unique incident response model there that doesn't really apply anywhere I've been since then, but there were still lessons that I've taken away from that, you know, number one is mitigate the issue as possible. Right, I'm so interested to hear how how did that information help

30:48

or hinder mitigation for your teams. It really set the priority and kept us focused, you know, because as that number went up, you started to understand, you know, the impact that this was having. And this was not a other development team sucks or their network is terrible, or and many many of our incidents it was because of user error at one of the trauma

31:22

centers. But it's still not okay to say, oh, well, they're just doing it wrong, because you have to realize at the same time, you know, while you're on the phone with that person, they're up on a table in the emergency room doing chess compressions on this patient. So they're going to give it their best shot, but they may not be the most attentive user at that time, and you just got to work with that.

31:47

Yeah, you're you're highlighting like a perfect example of I think why we are so focused on post incident learning, and it's because the most important aspect of these complex technical systems that we're all building and maintaining are the people involved, Right, and when you're in an incident response room, a major incident room, whatever, and you've got someone reminding you of the impact, especially when that impact is you know, not just on dollars, but also on people's

32:15

lives. You create the conditions for this like profound human creativity, right in terms of figuring out, you know, what can we do as a team to kind of we're back to the incident management space, what can we do as a team to kind of come up with a creative solution here and get

32:35

us back to good, you know, temporarily. And I think if you're not reflecting on and talking about those moments in incident response and your you know, postings in a learning review or narrative review, whatever you call it, and you're missing out on all of those examples of the ways in which the people are helping support the system and keep things moving. You know, we hear a lot in tech and DevOps and elsewhere that like automation is the key

33:04

to sustainability and more reliable systems. And there are things that we can automate, you know, especially assigning roles during incident management and response. But there's a lot of you know, human involvement tweaking the system and adding you know capacity, not you know, technical capacity in terms of number of network requests. You can handle things like that, but adding capacity in terms of the

33:30

system's adaptability. And I just like, I would love to be a fly on the wall for one of those incidents that you mentioned, because I imagine folks really came together and came up with some creative solutions to find a way to mitigate those incidents and get things back to good so that they could figure out, you know, what the long term solutions were. That's such an exciting like space, Yeah, for sure, and it's you know, it

33:55

was a majority of the role was communication. Like all of all the my coworkers there had exceptional technical skills, but their communication skills were just a plus one on top of that. And I think that's what made it work so well. And I still say that to this day. You know, DevOps

34:17

is not a technical world. There's a technical comm component, but it really is communications in building the technical framework, but then communicating that out to your customers, the engineers that you support, and getting the feedback from them to understand what's the difference between what I built and what they thought I built. Yeah, it's It's really great when you have those folks who kind of know how to be in a critical situation and maintain you know, effective communication and

34:46

find a solution to the issue. One thing we talk a lot about is like how do you scale that, how do you how do you externalize those skills? Oftentimes we find that the folks who are most effective and inti and don't have the capacity or time to help up level or train folks into that

35:04

discipline. Right. It kind of requires a lot of different skills. You need a technical expertise, you need experience with the systems involved, and you need a good handle on like effective communication, not just for communicating the status of the incident, but also communicating with the folks that you are directing if you're in an incident commander role for example. There's another area where if you invest in learning from these things, you can create artifacts that folks pick up

35:30

when they join the organization. Right in almost every large org I've been, there's a confluence space or Google drive folder is something full of post incident reviews. Sometimes I'll just go in and read those, right, and you start to learn, you know, who are the folks who demonstrate an ability to kind of respond to some of the most significant incidents and what are they doing,

35:53

how are they doing that right? What skills or actions have they taken that stood out in the learning room review should I try and cultivate as a

36:04

responder? And so that that can be a really interesting space too, is not just learning about the system and what things we can change to improve performance of at the time, but how are we leaving breadcrumbs for the new folks coming into the org who are growing into that discipline, because trial by fire during a major incident can be a really stressful, kind of terrifying experience, and so the more you can kind of give, you know, these these

36:30

anecdotal or story based accounts of how things go in your organization, more comfortable folks and feel when they step into that role. Yeah, I think it's one of those areas where there's like a mentoring path there. And as I have gotten older and been doing this for a while, I've realized that that's that's a larger part of my job is sharing that that context because you can put the documentation, but then there's also like the unspoken or the unwritten part

37:04

of that. You know, there's the mood, the field the context of the situation. And I think that's been a problem for you know, far beyond my lifetime, and the only way we've been successful at solving it now up to this point is just through that mentoring type role where you bring people in even though you know that they aren't ready to be the lead in this, you bring them in just so that they can can witness it and start

37:32

making notes for themselves. Yeah, and that's where a process or a policy around incident response and incident learning that is based on transparency can be really helpful. Right, Sometimes you get a lot of folks joining the major incident room that are trying to contribute in ways that may not actually you know, help with mitigation. But a lot of times we find in large organizations that have you know, policies angle toward transparency, folks just joined to kind of understand

38:02

and learn in the moment and also after the fact. So, you know, the the incident learning review calendar is always a place that I go and try to figure out, you know, which which of these incidents are going to be most helpful for me understanding the way this organization operates and the critical

38:21

systems. Right in the past, role we had a COFCA platform that was you know, involved in a lot of incidents, not because the COFCA platform was a problem, but because everything was built around it, right, So every time there was an issue with any system, that kind of tied back

38:35

to there. And that presents a really interesting lens for you know, how do these folks communicate with the low broader org and what changes are we making to shore up some of those critical dependencies, And you know, just being able to join a conversation about that, not having been involved in response or having anything to do with the teams involved, can be a really powerful opportunity for you to kind of learn about the team that you're working with and the

39:00

underlying technologies. Especially for folks like me, it's been eight years since I was you know, maintaining those types of platforms and so picking up on some of that nuance so that I can support the folks who are around those systems can be really helpful. There's a line there, though, You've got to

39:16

make sure that expectations are clear, right. If you're participating in something for the purpose of learning, you're kind of a sponge rather than someone who's bringing opinions, you know, not having understood the circumstances of the specific incident. So you need a healthy kind of culture and set of expectations around this.

39:37

But I've seen a lot of orgs that do it well, and it is a game changer, you know, for for helping to provide you know, scalable mentorship and opportunities for folks to get a better understanding of the details. Yeah. One of the things you commented on that I think just can't be

39:57

elaborated enough is transparen see. And I've worked in multiple places, and when I first started my career, it was it was in many instances a fireable event if you created an incident, and for that reason, people would try to hide and cover up their incidents, which led to no one learning from that. And these days, you know, I almost paraded around you know,

40:25

hey, I broke this because there's a learning opportunity there. And I think it's really important to be open and to build the environment where people aren't afraid to say that they made mistakes, and even the dumb mistakes, we all do them, you know, you learn from it. And I actually, at some point in my career a boss of mine told me, and

40:49

it's an anecdotal story, but it's still effective. Someone created an incident cost several hundred thousand dollars and said, oh am, I going to be fired now, and the responded, no, I just spent two hundred thousand dollars on your education. Why would I fire you now? Yeah? And this is where I think, like, it's really difficult to build trust, right, It's really easy to damage trust, it's really difficult to build it.

41:17

And so if you're approaching your your incident management, you know, life cycle and process from the perspective of trying to support folks doing what they can to help the business be successful, you get a lot of really impactful contribution and collaboration with regard to you know, keeping systems healthy and things like that.

41:39

But if you over index on you know, the measurable metrics. We're humans, right, every every human will gain a measure Right, you start to cultivate some of those types of environments where you know, what's the consequence of me doing the right thing here? Is is it going to reflect poorly on

41:58

me? Is it going to cause an issue? And so, thankfully, I think every organization that the Jelly has worked with over the past two and a half years since I joined two and a half plus years, they've taken the approach that, yeah, these are these are blame aware learning reviews.

42:14

Right. We know that folks make mistakes, that they don't have sufficient context in the moment, and that they can learn from those experiences and change their approach next time, versus this kind of you know, older model we'll say, of prioritizing the the the you know, public visibility of how things are going, and maybe like maybe we don't declare an incident for that one,

42:39

we just try to fix it quickly. Early in my career, I was learning how to use Microsoft SEQL databases and we had a large share point site. It was another medical audit company, and I learned what drop database commands do, and I did the hire production database and fortunately I had enough experience to quickly restore it before anyone noticed. But that was an environment where I didn't feel comfortable, you know, broadcasting that I had just seen in the

43:10

process of learning some new commands stopped the entire database. So yeah, it can be a tricky balance, but you know, some light is the best medicine, right. Transparency in these types of environments allow folks to do what's necessary to get things back to good And I think the more you can kind of socialize and demonstrate that transparency, the more effective your organization is going to be, and the more folks are going to want to contribute to that mission,

43:39

whatever it is. Yeah, yeah, absolutely agreed. So let's talk a little bit about what's going on with Jelly these days. Yeah, So Jelly has been like the most interesting experience of my career. I think I mentioned I joined in twenty twenty one. I think it was Jelly was just

44:02

a post incident analysis tool at that time. So we had this notion of building narratives and not much else, and we recognized that part of the post incident learning process involves having good data, and the way that you get good data is you get consistent in your process. And so we ended up building this incident response bot and we also went to the other end of the spectrum

44:24

and started introducing features for cross incident analysis. And so this is, you know, after an incident, let's spend some time learning, but then how do we roll up those learnings into themes across incidents that help the organization make decisions around growing teams to support services or changing direction with regard to build versus buy those sorts of things. And so we've been working on a lot of

44:51

cool stuff for the last two and a half years. And then in what was it, I think November seconds the public announcement that we were merging with Patrie Duty went out, which has been like really exciting and also a crying experience has been a month, right, And so page Duty is something like eleven hundred employees as of January of this year, we were twenty one. We're kind of in the process of figuring out how to bridge those two divides.

45:25

And one thing that I'm really excited about is, you know, Jelly has spent a lot of time differentiating itself as a product in the postings and learning area, and I think we've brought a lot of kind of novel approaches and opinions students that response in general. Patri Duty has been doing this for fourteen plus years, right they and they created the category within which Jelry could

45:51

become a company, which is pretty cool. And so what we're looking to do now is to take that practice, you know, post incident learning really get folks from the earlier phases of the maturity level where they're just doing incident response and maybe they're doing a post incident learning review on a Google doc, and bring them into the modern right and start creating incident narratives and doing learning reviews. Page of Duty has something like twenty seven thousand free and paid customers.

46:21

There's a huge opportunity there to help folks understand a better way of kind of benefiting from all. So that's my focus right now is figuring out how do we bring those two worlds together while keeping an eye on preserving that kind of post incident learning tooling and opportunity. But yeah, a lot a lot of exciting stuff on the horizon we are. We are going into a new year, so I think things will look very different on the Page of Duty

46:53

side and probably also improve on the Jelly side as well. It's going to

47:00

be it's going to be really interesting. Yeah. I think it's a natural fit, you know, because Page your Duty is hands down a great tool for notifying people that there's something requesting their attention, but what you do after that is kind of up to you, and so it seems like a natural fit to just roll that right into into Jelly and and help help people like just from a business perspective, take this huge page your Duty customer base and

47:37

just guide them into the thing that they thought they were doing all along. Yeah, one one focus for us has always been, you know, how can we improve the quality of our customers' postings and learning reviews, and how

47:52

can we allow the folks conducting those investigations to focus on what matters. We've talked to organizations where, you know, there was one problem manager at a a company that used Microsoft teams, and part of their job was to go through every team's channel and find transcripts associated with an incident and put them in service. Now, nobody should be doing that, right, That's just toil.

48:16

That's that's not productive. And so one thing I'm especially excited for with this partnership with page Duty is or this this acquisition by paye Duty, is they've got a ton of data, right, And so when you're building your post incident narratives, your timeline, and you're adding evidence and you're trying to help folks understand the details of an incident, the more data you have to substantiate those claims and those events that you're highlighting in the incident, the more

48:44

folks can learn from you know, the not only the overall shape of the incident, but the systems involved and how they're used to understand you know, the underlying technology. And so there's an element there that's really exciting, which is just we have a lot more data to allow our our users to work with. But I also think, like I said, Page Duty has been known for a really long time, UH as kind of an industry leader in scheduling and alearning. Right Uh act and bail I got paged. I'm gonna

49:15

go fix it. Uh. There is a better way, right, Like, there are ways to tie that process into the incident response process and the postings of the view and I think that's that's going to be our focus over the next you know, several months, is figuring out how do we give pager Duty more mechanisms for supporting responders throughout the entire incident management life cycle, not just the detection phase, which a lot of folks know and they're familiar

49:45

with, but you know, Page Duty's full operations cloud, which most folks I've talked to don't even know exists. Uh. And and this is you know, the the AI automation for reducing noise to signal when it comes to events. This is all of the mechanisms around running actual incidents, and then this is the post incident as well. Pad has a feature today called post mortems, which is fairly straightforward. It's your your Google post mortem doc.

50:15

But we think there's a lot of opportunity to not require that folks are going and creating these data sets manually, but just kind of provide that information so they can use it to better narratives that are living all the things. And

50:30

yeah, I think I think it's a natural fit too. I mean, I've been using page of duty for basically my entire career, right and being able to bridge that gap between that paging and scheduling and then the things that I need to do to help my team be successful, it's going to be huge, you know, for from my experience. Yeah, I think having access to that data is going to lead to better collaboration after the fact,

50:57

because that's for me, I've always struggled with that. You know, after the incident's over and you're trying to do the review of it, trying to remember what things happened in what order and remember all of those steps that you took. So if you've got something that can can prompt you with reminders and kind of pre populate that narrative for you. I think it's just going to

51:17

lead to better, better results at the end. Yeah, there is nothing better than having a starting point when you are trying to investigate an incident. Right, If you open an empty Google doc, it's a hard time. But if you can start with you know, in Jelly today, you start with the incident transcript, all the conversation that happened in slack and data about

51:39

who is involved, so much better than starting from nothing. And that's especially true when you know your incident response process uses multiple data sources like multiple incident response channels or your data dog charts or what have you. So we're not really looking to do the post incident narrative for you. We're looking to give you a point to start from because that saves time, it saves energy, and let's you focus on the things that only you can create within your post

52:12

incident narrative. Right, the investigator is a conduit through which the folks who are involved in the organizational miscellany kind of come together into a coherent story about what happens and what it means. So we really want to like provide a foundation on top of which folks can have these conversations. And I think there's a lot of opportunity there with this kind of broader spectrum of data and integrations within customer's existing processes. Yeah, it reminds me a lot of like a

52:47

there's a like a a people skill there. You put two people who don't know each other in a room and anything could possibly happen. They could strick up conversation, they could sit there in silence. You know, there's just no way to gauge it. But then if you give them a conversation starter, then you can sort of like guide the results from there. And I think I think that's the real value of what the post incident narrative does,

53:19

is it's that conversation starter. Yeah, that I mean certainly for us, as you mentioned, like we use Jelly internally and we do our own learning reviews. I think the exercise of you know, mitigating the incident, putting together the learning review, those are valuable experiences for the folks involved. But getting everyone in the company, because we can do that at twenty one people into a room to talk about what happened, to ask questions to figure out

53:45

what did you know? What did you not know? You know? What did I know? And I wasn't involved those those sorts of things. That's where you get really interesting kind of exponential increases and understand And it's not just the thing that most excites me about these learning reviews is it's not just the understanding of the technical or the organizational process. It's the understanding of each other.

54:08

Right, How how I communicate in these environments? How you communicate what your expectations are, what sorts of things I need to be better about informing during response. It's a it's a retro right, and the software can't operate itself. And so if the people are working effectively together, then the software

54:28

is working effectively and if they're not, then it's not. And I think that's that's one of the really big opportunities, especially for you know, the large complex organizations in novel economic environments, to figure out, you know, how do we improve our efficiency in our collaboration so that we can do what needs to be done? Really exciting, Oh, it is really exciting.

54:53

I'm looking forward to seeing how this plays out for y'all. Yeah, I'll have to let you out there's you know, we're in a phase right now where there are too many good things for us to do, so we got to figure out the next best thing and focus on that. But yeah, that's that's the spot I want to be in. Endless opportunity ahead of us. We just got to figure out how we're going to get that to our

55:20

customers as quickly as possible. Yeah, for sure. Yeah cool. So, anything else you'd like to share with us about incident response, Jelly, page of Duty, any topics at all. Yeah, if you're not already using Page of Duty, take a look. It's the best thing for paging that I've ever found. And if you want to give Jelly a try, there's a free trial on the site and we start you off with some pre

55:45

built learning reviews so you can see what they look like. Start playing around in there, and if you have any questions, you know, I'm sure you'll be able to find me in the show notes here. But it's been really great to meet you, Will and thank you so much for opportunity to chat. No, thank you, it's been a great conversation. I've enjoyed it. And uh, if you're up for it, I would love to have you back on the show'd that'd be great? All right? So much?

56:13

All right? Cool? Well, thanks for listening everyone, and I will see y'all next week. M

Transcript source: Provided by creator in RSS feed: download file

Bridging Processes and Improving Incident Management - DevOps 187

Episode description

Transcript