Incident response policy and practice

00:00

I think if there's one thing to learn from that is if you suspect something, disconnect, do not slip down. Welcome to DevOps Sauna season 4, the podcast where technology meets culture and security is the bridge that connects them. Welcome back to the DevOps Sauna. I'm here with Darren again, hi Darren. Hey Mark, it's always nice to have a conversation, especially in your area of expertise about security. Yeah it's always, I always feel a little

00:46

bit more home discussing security than I do the DevOps topics. So let's see what we can go up to today. Yeah, I heard that you were one of the what three people that could get tickets to disobey. Yeah, the tickets for disobey sold out extremely quickly this year. I think it was two minutes and then they were gone, but I was lucky enough to get some of them. So it was an extremely cool event. Pretty high demand. How many people were there?

01:11

I think 1,800 attendants in total for right over the two days. That's like everybody's just sitting there refreshing waiting for the tickets to open and then boom it's all gone. Mm-hmm, almost instantly. There are a lot of cool talkers, especially those just one talk given by the police, which was entertaining. If slightly intimidating to have three people of police uniforms stood on stage. Help the police, please ladies and gentlemen, and

01:37

those who've not yet decided. So any big topics there? Is there something that we could talk about today that came up at disobey? It's actually a topic that I think was kind of missing from disobey because there was a lot of technical things, but one topic I think we should discuss to everyone needs to be thinking about right now is incident response.

01:55

And there was actually very little about it, but at least maybe I was busy with the capture the flag while the event as a speaker's were talking about that particular subject,

02:04

maybe you and I can handle it here. I think we can. I like this topic a great deal for many, many reasons, but I also kind of understand it's not the sexiest topic in the room for a conference on security, but I think this is one of the areas where companies can do an awful lot and get a lot of bang for the buck and just taking appropriate actions to be able to have a proper incident response policy and things. What is incident response?

02:30

I mean, it's exactly what it says on the tin. It's responding to security events as they happen. So it's usually broke down into phases. You have discovery phase, which is your moment where you realize something's wrong. You say, okay, this is a security incident. There's

02:48

an analysis phase where you analyze what's happening. There's a kind of action phase where you talk about or you don't talk you actually implement to the fixes, whether they're temporary or permanent and then post game phase kind of record analysis, that kind of thing. And all of these for things together, make incident response. I think that the neat thing that this is all about is you can do this, whether you have

03:14

tools in place or not, can't you? Yeah, I say this comes mostly as a policy and mindset approach, because if you ask me about good incident response tools, I'd have to say they're probably on to any because you have to be able to respond to any kind of incident. And if you have a tool for it, it's going to be focused in a specific area. But it's more about how people are trained, how people go through like annual security trainings, any

03:45

kind of simulations, any kind of testing you have. And you kind of build up this resilience to bad situations happening. And basically how to not make them worse, how to prevent the spread, how to mitigate them and limit them as much as possible. So I talk a lot in my work and mentoring work and other things about the art of practice.

04:11

And one of the things that you highlighted for me when we've been talking about incident response is that practicing going through drills, going, you know, fire drill type of things, going through, you know, how do you respond to various incidents and simulations

04:25

was a word that you had used. And I think, you know, it's not just about having the it's it's certainly not about the tools and it's not only about the policies, but it's also about practicing and exercising how you will handle different types of incident responses and anticipate those that may come in, you know, from a place that you weren't expecting at all. Like, you know, I guess COVID was one of those. Precisely yet. And the importance of drilling is one of the things that's so understated.

04:53

So if you have a perfect incident response plan that you've never practiced, it's essentially useless. It's just information on a bit of paper. And if you then when you start having an incident, if that's the first time you sit down and read your incident response plan,

05:11

that is just time lost as you try to understand what is written there. So you need to drill, not only do you need to drill on the people, but you need to drill the plan itself, you need to identify in the plan, which parts of its make sense, which parts of it are valid, which parts of it don't make sense and should be removed or should be refined. So you end up with this kind of circular set of testing where you you generate a test and you kind

05:39

of test in both ways. You test people's ability and readiness to handle incidents and you test the plans ability to support people in handling those incidents. And then you've run through that handful of times and you get this thing that should work, but you can't know if it does work until something actually happens and you have to do one of these things for real. This is not a drill. So what kinds of things then go into an incident response

06:09

plan? Maybe we should start with the identification phase. How do we identify different types of incidents? And maybe what are the different types of incidents? Okay, so in an incident response plan, it's not going to have categorizations of incidents because an incident response plan is something you actually pull up when you know you have

06:28

an incident. So it's therefore specific incidents, but an incident can be anything. For example, I do like to talk about that it was actually an FECO security system is someone lost their phone by dropping it in the Baltic Sea. And that's a security incident. It's not what we can do anything about. We're not going to go and get scuba gear and start diving for their phone. We just have to assume that tides will take care of it and remote lock it

06:56

as much as we can and move on. But these kind of things happen. I'm wondering if this is going to backfire at some point and someone's now going to go diving for an FECO phone in the Baltic Sea. So if that happens, I'm going to be in a lot of trouble, but I think we'll be safe. But that's an incident, but that wouldn't necessarily trigger the incident response

07:15

plan. An incident response plans are for active incidents of a certain threshold. So what we would think of would be things like active attacks, malware being found on the systems

07:29

or traffic from suspicious IP addresses coming in in rapid timeframe. So we kind of have a situation where the incident response plan needs to be as flexible as the type of incidents, because as soon as we narrow it down to here are the incidents we might face as soon as we face an incident outside that plan, the question is, well, does this become triggered by the plan? Does this become a problem? So we have to make it general. And that's how

07:58

we start. So we have categorizations of incidents in various places and documentation. And then an incident response plan saying, if something is considered by the security team, a large enough incident, then it's triggered. And having that flexibility and knowledge in the security team is what starts the plan. Good. And then we don't need to go too deeply

08:22

in, but there's all types of also human incidents that require response. I accidentally saw some, or I saw somebody come in through the door that I had opened with my tag just as my elevator door was closing. And then by the time that I get back down, I don't find them. So that is a security incident. So we're going to find out who it is in more trouble later today. Let me find out who that person is and whether the phone gets pushed up.

08:49

This is an old one. And what I learned from this is many security officers are quite eager for you to over file incidents and would rather you file incidents than not. So in a case like that, I filed it with our security officer. They cleared the situation fairly quickly. They found that a key had been used within, you know, essentially moments of my key have been used and that it seemed as if it was a legitimate follow in that case.

09:16

But you know, there's these types of things that can kind of come up. Yeah, definitely. So there's a simple type of analysis that I just described, but let's talk about some perhaps more technical ones. So what what other types of analysis are there in phase two of an incident response? What we will see typically is if you are ISO compliant, you will have a centralized logging system. If you are compliant with ISO 27,000 12,000 22 version, you will

09:43

have some kind of threat analysis happening against that. Logging. So you will have some kind of maybe it's AI, maybe it's just standard algorithms, but it will be an analyzing traffic and deciding if this is suspicious. This is something that security teams have been doing now for 20 or more years. And that's what we're looking for. We're looking basically

10:07

for strange metrics. That's all that triggers are suspicion most of the time and metric like seeing odd CPU usage, where we don't expect it or perhaps processors running under users, we don't expect to be running those processors. We might track recent changes to files, for example, in 2007, when PHP was still kitten and you could form these quite sophisticated attacks by injecting obfuscated code into the first line of PHP files

10:41

by adding a load of spaces and then pulling your code. So when you opened in a text editor that didn't wrap the words onto the next page, it would seem empty. You just have a lot of question mark at the right side showing that the line continued. And so what we're looking for in this analysis phase is anything or in the discovery phase, let's say discovery and analysis kind of merged together in that it's anything out of the ordinary. And then

11:09

after we know that there's something out of the ordinary, we want to isolate it. So that means putting off the network, not switching it off that's extremely important. And I think if there's one thing to learn from that is if you suspect something, disconnect, do not slip down. Because if you slip down, you can actually lose things that are loaded

11:29

into the memory, you can lose temporary files, which might be important. So just keep in mind disconnecting to prevent any movement laterally, but not shutting down his ideal. Nice. And then analysis will continue once a, so we do some analysis in order to identify incidents. And then we have another phase called analysis where we try to figure out what to do about it. Yep, yep, that's true. And this is actually one of the more boring

11:58

phases because there is this mentality inside of security. I don't know if you've ran into it that's just saying there is no surefire way to make sure you've cleared up everything. And that's why we live in an error of backups. It's why backups are so vital. The correct thing to do if you have a security incident is new everything and restore it from backups. That's the safe play. I'm not sure exactly how many of our customers

12:26

and listeners are going to like that as a solution. But can you open it up a little bit? Maybe if you can tell me why you don't think they'd like it. Yeah. Loss of business is, is our interest. And of course, I understand we are compromised. But we haven't discussed things like attack surface yet. So there was an interesting case that came up during last week in a public nougat repository. There was a, there was

12:48

a vulnerability that was identified and then it was retracted. And what happened is there were in some places things failing. And then when it was retracted, people were like, hey, you know, we're not even necessarily a, you know, delivering to production type of system. We are an in-house system that didn't have literally any attack surface in this area.

13:09

So why would we allow our in-house systems in order to stop for a little while? Just because there may have been a vulnerability identified and we didn't even analyze if it had a valid attack surface, which in this case, it would not have very likely because it was all inside of a firewall. And it wasn't a malware thing. It was more like just a potential vulnerability for a package that was not even revoked. So, but you raised a great point there because,

13:33

yes, what you're actually talking about there is not a security incident. In security, we have two types of events. We have security notifications, which are, a vendor saying a package is in some way caught potentially vulnerable. And then we have an incident, which is an instance of a vulnerability or something else actually appearing. And security notifications that are just, hey, patch this, patch that upgrade, get lab, what have

14:02

you. These are not security incidents. And this phrasing is important because if you treat these as security incidents, you will have a security incident every other day at your company, depending on how large your tech for printers. This isn't a security incident. It is an event. It is a notification. And all that needs to happen is patching needs to happen. It's that simple. You just need to patch. However, an incident is a specific

14:29

instance of an exploit, a specific instance of malware. Some actual specific event that happens, not the possibility of an event. And it's important to draw that line because incident response only deals with the specifics, not the possibility. It requires an actuality. Very important definition. Thank you. I knew this, but I didn't exactly put two and two together that until we have actually identified that there's a malicious actor, it's not an

15:02

incident. Precisely. And that's when the response type of Nuke everything and reinstall makes sense when there is a specific threat actor, when there is a specific instance, when it comes to notifications, obviously, in Effie COVID, in the root side, every time we gas a curiosity notification, we don't blow up the platform and reveal that that would

15:23

be a considerable waste of time. Yes, I understand. Good, good. There's been kind of a change over the years that I understand as well, where there's been perhaps more emphasis, certainly on incident response over, and I don't want to say that people aren't paying as much attention to actually keeping the system secure, but can you even have a secure system, really? I don't think so, not while you wanted to be online and functional. Right. You can

15:51

have a system that is as close to secure as you get it. But let me ask you, how familiar you are with defense in depth? I have some familiarity with certain areas. Okay. So defense in depth, if I give a 30 second rundown, is this idea of breaking security into different layers into you end up with this image that's kind of like an onion, you have all these

16:13

layers of security. And the outer one is your out to most protection. And if you're talking about an office that might be like a building, it might be walls, fences, going all the way down through network security, endpoint security, so your devices and their antivirus, right to the core of like protected assets, things like your code base, your databases, and these

16:38

different layers of security that apply to each of them. And this is to switch over that I think you're describing that went from prevention, which was kind of perimeter based security, where keep the outside safe and assume that anyone on the inside is supposed to be there to a lower trust model, not zero trust, but a lower trust model of protection at every level. And that change is going on now, I believe, least privilege and things like this.

17:08

Yeah, kind of principally privileged or so things like having network segregation and internal access control lists to yeah, ensure that your police privilege people can't access highly sensitive data with us. So these are all controls that go into that defense in depth. And this is the mentality that's moving because we've stopped talking about if when it comes to cyber security and we've started talking about when and I don't think

17:37

well, let's say I would like to believe people are talking about when and not here. And if you are still talking about if you have a breach, then that's a problematic mentality that will lead to this kind of focusing on perimeter and focusing on a defense of keeping people out instead of having accurate and adequate responses. Good, good. I thought when

18:02

I said can we have a secure system, you would only say if we switch it off. Hi, it's Mark again, the DevOps conference global will be live in London and streamed across the world on March 14th, 2024. Don't miss it. See you there. Pretty much. I mean, that's the old joke. If you want a secure system, switch off, pull the network cable and lock it in a cabinet. You can also see, I believe it's called no code by our old friend Kelsey high tower if you would like to. He had a when you have no

18:42

code, it is absolutely secure. I see. I'm not familiar with that. How does that work? It's a satire on software development that also goes into the only secure system is no system. If I remember correctly. So I think if you don't code it, it can't be hacked. I like it. That's that's absolutely true. All right. So this is this is turned really interesting that it's not so much about the tools. It's not so much about the

19:08

technology. It's about and it's not even just having the policies. It's about if we continue to practice with these and we are encouraging the understanding of what incident responses truly are and working towards analyzing those. We get to the point where we have to start fixing things and you mentioned nuke it and and start from scratch. Are there any other

19:33

fixes that we need to think about or that someone might not have thought of? I mean, you can always try, but it's always going to be second best compared to restoring it from a backup and so many security incidents. Many kinds can come back to having a robust backup policy of ensuring that you have those backups in case of malware in case of ransom ware in case of corruption. So obviously if someone gets into your network and you are

20:05

unhappy with that, you can block them out. You could remove their back doors. If there's no persistence on there, you can monitor the server. However, it's difficult to know

20:15

that you have ever cleared 100% of traces to prevent them from getting back in. It's like trying to save one of your fingers and well, that is, it's kind of a weird position to be in where you have someone who wants to save a specific server and they don't understand they're taking the approach of pets instead of cattle because this is their server and they want to keep that server and they don't understand that the server is dying and potentially

20:43

spreading whatever killed it to the rest of their network. So it's a difficult mentality to phase off against. Of course, we do practice in IT in general for years testing that you can actually restore from backups. I guess the internet or the infrastructure as code aspect is also an interesting thing to make sure that people are able to redeploy systems from backups. Oh, definitely. That's part of the simply being able to get

21:10

all this fast as possible. We have when we do these disaster recovery things backup testing, we're looking at two metrics, the objective of return to normal business basically and the objective of let's say having everything resolved. So we have these two metrics. One is light. The sooner metric of okay, things are okay now and then slightly the metric of things are restored to standard business and it's all about getting the time to these

21:42

metrics as short as possible. So anything you can do to speed that out is great. An infrastructure as code is one of the most powerful tools you can have the ability just to redeploy immediately based on what I should say make small changes and then redeploy immediately because if you redeploy exactly the same vulnerable software that someone just got into they're going to get into it again. So you redeploy patch restore.

22:10

One thing I'd like to try and challenge it's difficult to challenge you in your own garden there and but the term root cause analysis has annoyed me for years and I found a converse to that which is contributing factors analysis and we still talk about root cause analysis all the time and it's even in a lot of specifications and standards but it always kind of feels like a little bit of the wrong idea to me because it's never one reason

22:34

that things fail. There's always I think more than one and that's why I always try to at least talk about contributing factors. But what do you expect from root cause analysis in incident response plants? I actually think your spot on that that we talk about root cause analysis but then there's also this term that people use which is blameless post

22:54

Mortem. Yes where they want to look at it and decide what happened and why and if you do root cause analysis not we already know that 90% of security incidents are caused by

23:06

a person who made a mistake. So if you look at root cause analysis it's the opposite of the blameless post mortem because you're quite literally pointing at a person and saying okay this incident was your fault because you are the root cause and that actually helped us know because it doesn't matter that we know 90% of incidents are caused by people.

23:27

We know this this is a known factor and it's not going to change what I wish it could but it's not going to because people are people and people make mistakes and that's why so much of this incident response is kind of around processors and people rather than

23:44

tools and technologies because it has to be about understanding people so yeah this contributing factor idea is actually where we should be going with incident response but it's still called root cause analysis even though yeah the root cause is actually if we look at it

24:01

as a root cause we think about a tree actually makes sense because there isn't a single root for the tree there is this kind of expanding network underground of different tendrils reaching into different places and if we examine this as the root cause and then this causes

24:17

the tree then it kind of makes sense but I'm sure that's not how people were thinking about it so 100% contributing factor analysis is where you need to go because even if a person makes a mistake it means there wasn't a safeguard there for them there wasn't a guard rail

24:33

for them there was nothing to help them that maybe not even awareness trading so yes contributing factor analysis is definitely the way to go and if someone like me says root cause analysis challenge me and tell me I'm wrong you brought this back to me really really well because

24:48

I have thought of this contributing factors for a long time but the way that you placed blameless post mortem to me actually gave it a new light because yeah an awful lot of times it does come down to a person and if we can make the event blameless then it helps the psychological

25:08

safety of everybody involved to understand that you know we're all going to make mistakes I mean software is one job that everybody should have because it's you're the best guys make mistakes and bugs every day yep and that idea of blamelessness is actually something

25:23

which will be taken out of the final phase and extended across the whole incident because if you start by telling someone that they have made a mistake the first thing they are going to do is clam up and try to cover their mistake it's human nature that they're just going

25:40

to say oh well I'm sorry I didn't mean to do this and then shut down and you don't get the information so blameless post mortem yes but the whole process should be blameless across the board because people make mistakes people will continue to make mistakes and the more

25:56

responsibility people get the more access they have the larger their mistakes are going to be hopeful that when we can hope that they will be less frequent but they will still occur all right there and so let's try this again so what is incident response?

26:13

incident response is exactly what it says it is it's how to respond to security incidents and not security notifications so when a specific occurrence of an attack or vulnerability is exploited that is an incident and how we respond to it how we fix it excellent when should we think about our incident response plans yesterday it's 2024 if you don't have an incident response plan in place at this point what have you been doing for 10 years? Excellent so what are the trends in

26:48

incident response? there's a movement from preventing to responding so it's kind of switching around to we know someone will eventually get in we want to mitigate the amount of damage that is done there excellent two more questions the four phases of incident response are

27:07

well they're going to be discovery where we find out something is happening analysis where we find out what it is the prevention action phase where we fix it and then the post-mortem phase the blameless post blameless post-mortem yes blameless we have to make sure it's blameless excellent all right

27:28

can we have a secure system? not perfectly no but we can do our best and that's all we're ever doing in security is what we can now all right cool you know I always learned from you there and and it's always nice to have a conversation about these things and I think it's neat how we're able

27:47

to look at different parts of it in different ways and still be able to understand thank you a lot for today thank you I hope someday we'll get to come here and I'll get to grill you on one of your specialist subjects because it's starting to feel a bit one-sided let's do something like

28:02

that next time all right we'll talk about jazz okay thank you everybody thank you Darren thank you about and we'll see you at the DevOps sound and next time goodbye we'll now tell you a little bit about who we are hi I'm Mark Dylan lead consultant to

28:22

Effie Code in the advisory and coaching team and I specialize in enterprise transformations hey I'm Darren Richardson security architect to Effie Code and I would to ensure the security I manage so this is offerings if you like what you hear please like rate and subscribe on your favorite podcast platform it means the world to us

✨ This transcript was generated by Metacast using AI and may contain inaccuracies. Learn more about transcripts.

Episode description

Transcript

Incident response policy and practice

Episode description

Transcript ✨

Transcript