Never Release on a Friday - The CrowdStrike Outage | The Better Business Analyst Podcast

00:00

The Better Business Analysis Institute presence, the Better Business Analysis Podcast with Kingsman Walsh. Hi everybody and welcome back to the bit of business Analysis podcast. And today we talk about don't release on a Friday. We're going to be talking about the crowd strike problem that caused wide outages across the world to critical infrastructure which affected Windows systems.

00:31

So I'm going to explain what happens and what was the cause of that, what the resolution was, and some thoughts on what we might be able to learn from this experience going forward. So in effect, the recent crowd strike issue involved a faulty update to their Falcon sensor. Okay, so the Falcon sensor is a cloud controlled sensor which is installed on Windows devices, but it's managed to the cloud and this is what caused the widespread disruption on Windows systems.

01:11

The issue was identified as a defect and a single content update for this Falcon sensor and this is Crowdstrike endpoint detection and response. It's called which the acronym is EDR software and it's actually focuses on in simple terms, modern age, you know, modern age virus attacks, OK. And this piece of software is very, it's actually quite a good piece of software which focuses on looking for these types of

01:49

attacks. This is not like your general run of the more antivirus software, just to be clear. So it, it, it has to have quite important system kind of access to the underlying running of your Windows system. So the software itself looks in there for viruses effectively and notifies you. But in order to do that, it needs access deep within the Windows system in order to do

02:16

its job properly. Now, what that opens up is that opens up to a massive risk here that if the software itself is faulty, then you're going to have a problem. So what actually happens or what what the impact of that faulty defect in that update for that piece of software, What it does it did is it caused Windows computers to crash, displaying what we call the blue screen of death, and it prevented them from just rebooting properly. OK, So when your computer fails,

02:49

this used to happen. We used to talk about it often with Windows. If you end up with this blue screen, it means basically Windows is saying something's wrong, I'm broken, I can't continue, right? And yes, you've got your own laptop or your own computer, but this is actually, remember that a cloud computer is exactly the same, right? So these could be both a physical computer or a cloud computer.

03:12

And a lot of the software and applications and infrastructure that the world's run on is running on Windows servers. So this affected them as well. OK. Or mainly them because it was rolled out automatically via the cloud. And what that meant was that this blue screen, so they've failed if you like, the computer just shed itself, if you like, into this blue screen of death. And then when it tried to reboot, it actually ended up in a loop. It ended up in a loop of boot

03:44

booting. It was just stuck there. And of course with a blue screen of death, never got to Windows again. So you could use your mouse and keyboard and it just rebooted itself again and again without really understanding low level IT to be able to stop going to safe mode, which doesn't load all the windows up. It just loads critical factors up. You know, go into the BIOS and knowing how to do that, which people don't have to do that anymore, right.

04:12

You know, only if you're an IT professional would you need to do that. So you to fix it straight away, you actually had to have some knowledge of it to fix this. This is why it just took so long for there to be a resolution.

04:25

So straight away Crowdsource, you know, told people how to fix it. But I looked at the instructions at the time and those instructions as well, even though they're easy to read, they would not be simple for any standard person to be able to do without IT knowledge and even first line support. Some of the IT people in your company wouldn't know what to do. It would involve them going to someone like who had some IT

04:48

administration knowledge. And remember, of course, there's, you know, only a limited amount of people do who do that around around the world. And sometimes that's outsourced to smaller companies, sorry to bigger companies who do that. So you, you know, you have no control over your infrastructure. You had to wait effectively. Now in the meantime, they rolled out some some fixes to fix this problem, identified what the problem was.

05:11

Both Windows and Microsoft identified what the problem was. And so they're working on code to make sure that this update wouldn't affect their codes. So writing some kind of catching statements in there to stop Windows from being so affected by it and also crowdsourced in luck to roll out an update. And they had to get all that stuff updated, fixed, tested and then rolled out, of course, to all these servers across the

05:36

world. And then once so servers were up and running, then people were checking things like, they're like, were we hacked? Banks were worried about, you know, this was this used as an opportunity? Would have been a perfect time to kind of attack some of these organizations, which is ironic because the software is there to protect them from that stuff. If you're technical. The defect was traced specifically to a file within the update which caused the crash.

06:00

And the file was identified as a file called C-0000 two 9/1 Astros dot sys. So there's a there was a specific file that was corrupted. OK, so it wasn't a huge, you know, it wasn't anything major. And this is what happens, right? And and and this can a small change or a small defecting code can affect things. And the resolution was that the engineering team isolated the issue. They put it fixed. So another file to replace that file. That's all really and immediate relief.

06:32

They'd how to like what I said, go into safe mode. Windows recovery follows, navigate to drivers, delete the problem to drivers and drivers. To be clear, drivers are software that interact with hardware on your computer. If you've ever heard the word drivers, you need to update your drivers. And of course that's where the problem was. So it's quite low level in terms of code, the code, especially these days with low code. And so that's why it had such an impact to the Windows machines.

07:01

Crowd Strike confirmed it wasn't a security incident or a cyber attack, which people thought it was and they were working closely with affected customers. And the CEO of Crowds Crowd Strike has talked about it. He's one of the highest paid CE OS in the world. I don't think he's going to be around much longer, to be honest. And what I get onto just really quickly seeing this is just a update is the fact that what do we, what can we learn?

07:29

What can we learn about this right now so that we don't ever do this again? Well, we talk about something in software, in the world of software, if you work in that space and we, we call it don't release on a Friday, never release on a Friday, or there's different variations. And what that means is there's no one limited support in the weekend to fix things. If you roll something out on a Friday and something goes wrong, there's just not many people

07:58

around. They've gone out of the office, they're not there to fix. There's no time to resolve something if something critical happens. And what would really that's what's happened in this situation. And we'll find out more as I'm going to make a production is that this is around really bad software development and release. OK, So the fact that there's a couple of things here, one, that crowds Strike could roll this out and hadn't done appropriate

08:25

testing. At the end of the day, they hadn't done appropriate testing on enough different devices or enough devices in the same conditions in which were running critical Windows Server updates. They just hadn't done that because if they had, they would have called the buck. So that means they hadn't done appropriate testing.

08:43

And equally, it was the fact that the software in which and the hardware in which Crowdstrike is allowed to operate on Windows servers seems a bit extreme that a third party, that a third party which is Crowdstrike, nothing to do with Microsoft, just to be clear, their software, their mistake was able to cause such an impact to Windows service, which is supposed to also be safe, right?

09:11

And that can that can open up a whole other kind of worms that this is just one piece of software running on Windows servers, but there's actually thousands of them. So Windows Microsoft need to really look at what power are they allowing or divesting to these third parties and their

09:27

software needs to be better. So if they do identify software that's bad, that could impact can catastrophic impact to their product to Windows Server, then they need to write code which is looking for this and doesn't allow an update in this situation. Or it doesn't at least if there is an update and it doesn't work properly, it doesn't cause the

09:50

effect that we've just seen. So this is a hard look at software engineering one O 1. It's, it's a good opportunity for us to just look at our own code and our own practices and software deployment. And it's really ironic because we talked about that Y2K last week with business rules and now business rules are so important. Well, this is exactly, this is a business rule, right? This is a business rule in terms of not allowing third parties to release code that affects our

10:18

software. I mean that should be a number one business rule. And so therefore there should have been system rules in place to stop that from happening. And so Microsoft needs to look at their business rules from a business model point of view because they are going to be affected from here. They could lose a huge amount of

10:32

business. So people go to Amazon or go to Google as a result, and people just gonna be really worried about this and they're going to really have to explain to some of the largest clients in the world why this happened. Crowdstrike is probably the end of Crowdstrike in terms of a company or a trust. And I imagine the CEO will have to step down. Anyway, that's my kind of update, critical update, just in round Crowd strike. I hope you learned what happened and why that affected everything

11:05

so much. And it's pretty amazing in 2024 that, you know, our biggest IT outage ever was caused by someone writing a bad piece of code in one file which affected everything, as opposed to a cyber attack or anything like that. OK, guys. See you later.

Transcript source: Provided by creator in RSS feed: download file

Never Release on a Friday - The CrowdStrike Outage

Episode description

Transcript