(BNS) The Crowdstrike Thing (With Overmind.tech) - podcast episode cover

(BNS) The Crowdstrike Thing (With Overmind.tech)

Aug 24, 202423 min
--:--
--:--
Listen in podcast apps:

Episode description

Breaking down the Crowdstrike outage with Overmind.tech

See Privacy Policy at https://art19.com/privacy and California Privacy Notice at https://art19.com/privacy#do-not-sell-my-info.

Transcript

Welcome to another bonus episode of the Techmeme Ride Home, another portfolio profile episode. This is sort of a hybrid because we're going to talk to a company that Ride Home Fund is invested in, but it's also going to be sort of a newsy explainer episode. We're going to talk a lot about that big Crowdstrike outage because today we're going to talk to Dylan Rackliffe of Overmind, and this is what Overmind does is help people prevent things like this from happening.

So Dylan, thanks for coming on. Thank you for having me, Brian. So let's just start there. Tell me what Overmind does, sort of 30,000 feet sense, and then we'll drill into what happened with Crowdstrike.

I mean, you were exactly right with saying that we're trying to prevent these sorts of outages, outages where you make a config change where you think it's a good idea and it tends to be a terrible idea, hopefully not a serobeless Crowdstrike one, but we calculate the Blastradious workout that dependencies and do a risk analysis in advance so that people can know that pressing this button could potentially ground all the flats in America. That's what they pressed the button.

I love that term Blastradious, but obviously there was a huge Blastradious over Crowdstrike thing, but going to more detail about being able to identify those risks ahead of time, because that's the key here. Once you've hit the button, it's too late as we've seen.

Yeah, the only way that we can do that is by understanding the Blastradious and the context to what we've found, and it's the case in the Crowdstrike outage, it's the case in every outage that I've studied where I've been involved and when I've studied just from stuff that has been put out on the web, like in the Crowdstrike example, is that the change that you were making was not bad on its own. It was only bad when combined with other latent factors that already existed within the environment.

So when we're doing a risk analysis like this, it's not enough. It's not possible to look at a change and say that is a bad idea. You have to take the change in context, understand all of the dependencies, understand how they are currently set up, is this thing actually in use? What uses it? What is that thing doing in order to work out whether one change that might be fine in one environment might cause a massive outage in another environment. All of these things are entirely context dependent.

So working that out previously has been done by people with loads and loads of experience who have a model in their head of how all this stuff fits together and what depends on what. We're trying to sort of augment that by building a model dynamically, doing the risk analysis dynamically and helping out. Context is key because the changes themselves are never bad on their own. And this, over my works where people are working, so this is mapping your dependencies on AWS, Kubernetes.

This is essentially a layer where you're working that is just this added sort of insurance policy, I suppose. Yeah, and we deliver it to where you're working. So if you're working in GitHub and you're using GitHub actions to run your Terraform into AWS or into Kubernetes or whatever you're doing, we deliver the blast radius and the risk straight into say GitHub as a comment or into Terraform Cloud or into wherever it is that you're currently working today.

All right, so let's talk about what happened with CrowdStrike and I'm going to caveat obviously that this is beyond my can. So CrowdStrike actually has released, I think, a couple of post-incident reviews at this point on a high level to a dumb dumb like me, but also for non-dumb dumb's out there. Tell me what happened, what was the fatal error here? So the fatal error that actually caused everything to break was an out of bounds read exception.

There was an array with 20 things in it and it tried to load the 21st thing and everything fell over. The reason everything fell over, that wouldn't normally be a terribly catastrophic problem trying to read the 21st thing in a 20 element list. Unless you are running in kernel mode, which the CrowdStrike driver needs to in order to do the work that it does.

So when you're writing software that runs as they drive in kernel mode like CrowdStrike, you can't make those sorts of mistakes you can't afford to. There is nothing to catch you. If you try to read something that isn't there, the whole computer needs to restart because there is no way to recover from it. And unfortunately in this instance, the situation that caused it to do that action basically happened immediately.

This wasn't like a 1% chance where the stars have to align and then it reads this 21st element in a 20 element list. It basically reads it straight away, which means the computer crashes. It starts back up and it crashes immediately again, which sent these computers into the blue screen of death loop. That was the thing that actually fundamentally caused it, which sounds very simple, but how we got there is probably the more interesting part.

Well, right, because I mean, we assume these are professionals. There's all sorts of automated and manual testing that I'm sure happened. So again, what was the thing that they missed? Well, so my caveat is that I'm getting all this information from these post-instituted reviews. I don't have an internal source either. So there is a reason why I'm reading between the lines that needs to be done in order to find out what was missed because they don't just say he is the thing that we missed.

What they do say, and it's kind of interesting, they start off by saying in the first preliminary post-instituted review, they start off by saying, what are they talking about? How the testing works for the file consensus. So the file consensus is the thing that you actually install. It's the thing that goes and inspects network traffic for suspicious activity and stuff like that. And they go into quite a lot of detail about how that gets deployed. They do automated testing.

They do manual testing. They roll it out internally first. Then they roll it out early adopters. Then it becomes generally available. And when it is generally available, users can select which sections of their infrastructure get the upgrade first. So you could upgrade the less important stuff first. So that's a pretty normal deployment process to be honest. And it's pretty well explained yet. Here's the diagram.

Yes. If you're watching the video on YouTube, this is a diagram that Dylan, did you take this from the after incident report or did you draw this up yourself? I drew it up myself after trying to wrap my brain around what they were trying to explain to me. It took quite a lot of effort. I would recommend having a look at them because it helps to explain the jargon. But what stands out with that process is, well, you've got automated testing, you've got manual testing, how did it happen then?

Like how did it get to the point where it broke all of this stuff? And so the next step in the post incident review, they end up talking about what are called rapid response content updates, which is a separate type of update, which follows a separate process. And that update process is way more complicated. I'm not going to go into depth. You can read the blog post if you're interested in depth about how they explain it here.

Basically, they go on to explain the architecture of how these rapid response updates get delivered to you, but not how they decide to push something out. There's a server that delivers it and all that sort of stuff. And they explain that in detail. But they don't explain how they get the confidence to press the button to send the rapid response update to people.

Whereas in the previous example, they did explain, we do this testing, we do this testing, but when it comes to rapid response content, we get a lot of detail about the architecture and essentially no detail about the process. How do they get the confidence to press the button?

And I think you just have to read between the lines there to work out why why it certainly if I was writing it and there was a huge amount of testing being done, I would have mentioned it in that situation, but I don't think we're ever going to get confirmation that there wasn't any testing for probably liability reasons and things like that. But you suggest that there had to be some level of awareness that there could be a problem with this.

Is it one of those where it's like, well, we think it's possible, but it's probably not going to happen. So let's just go ahead. Like, what do you think the thought process was? It reminds me of like, you know, the part in in Oppenheimer where it's like, there's a small percent chance that the olioctogen on the planet could burn up. But yeah, we're going ahead anyway, push the button. I don't think that there was that.

I don't think that there was especially because this particular update is, they're a bit cage about what it was and fair enough, like it might have been a really important update to address a zero day that was happening right now. And so maybe there was a lot of pressure to get it out. They don't say that, but they probably wouldn't say that. So maybe there was a lot of pressure. It reads to me like as if there wasn't any semblance of risk. Like it didn't, it seemed perfectly normal.

One thing that's really interesting is they speak about the timeline of what happened. And the important events in the timeline are they did a whole bunch of testing on this new type of, it's called a template instance. Basically, it's a new way of discovering suspicious activity. They did a whole bunch of testing back in March when that was first released. And then they did three more deployments in April.

And then yeah, if we can get the the timeline view up, which is down towards the bottom of the blog. And the first blog, sorry. They do three more deployments of this particular type of way for looking way at looking for suspicious activity. And then it comes to July the 19th, which is the fateful day where they make the decision.

Now, they actually talk about in the post-insert view, what gave them the confidence to press the button, which is pretty rare that you would speak about how you felt in a post like emotionally. Why did you think that this was a good idea in a post-insert view? And I think that they should be celebrated for putting that in. It gives a lot of color, which I think is really interesting. And to ask the question of like, did they think there was a 1% chance? I don't think so.

The things that gave them confidence as quoted in the review were the fact that they did a bunch of testing in March. And the fact that they had already deployed the emula configuration to the same feature three times before in April. And the fact that there was supposed to be a validator that caught any of that. Now, in hindsight, the fact that something was tested a little bit over four months ago is probably not going to give me confidence. It's kind of work in production.

Similarly, the fact that I deployed three emula, but not the same pieces of configuration over the previous month, is also not going to give me confidence that something is going to work to be perfectly honest. If I'm deploying configuration, I want to know that it has that exact same thing has been deployed somewhere else and tested somewhere else, not just a similar thing or a thing that uses the same feature set, which is what happened here. So I think in hindsight, it doesn't make sense.

But at the time, I believe that this was normal. Certainly, by the way, it's written, they're not saying that people went outside the process. They're not saying that people did anything that they shouldn't have. It seems like doing it this way of testing it once when it's first released and then getting confidence by just keeping on using it was absolutely the norm. And it must have been working for a long time since, otherwise, it's what have happened earlier.

In 2023, just 10 vulnerabilities accounted for over half of the incidents responded to by Arctic Wolf incident response. Wouldn't you like to know how to take those off the table and make life more difficult for cyber criminals?

That's just one of the essential insights you'll find inside the Arctic Wolf Labs 2024 Threats Report, authored by their elite team of security researchers, data scientists, and security development engineers, and backed by the data gained from trillions of weekly observations within thousands of unique environments. This report offers expert analysis into attack types, root causes, top vulnerabilities, TTPs, and more.

Discover the attack vector behind nearly half of all successful cyber crimes, why ransom demands climbed 20% from 2023, and find out why 2024 will be an especially volatile year for cybersecurity. Learn more and get your copy now at ArcticWolf.com-fow-t-t-t-c-meam. Getting started building a nest egg is something you can easily talk yourself out of doing by saying, I don't have enough money to invest yet. Well, that's exactly the point. A, you don't need a lot to get started.

And B, you'll never have enough if you never start. A, corns makes it easy to start automatically saving, and you don't need a lot of money or expertise to invest with A, corns. In fact, you can get started with just your spare change. Acorns recommends an expert built portfolio that fits you and your money goals, then automatically invest your money for you. I like the fact that acorns is like banking with built-in investing.

They'll invest a portion of your money for you automatically, even automatically set aside money for emergencies. You can even invest for shorter term goals like a car or a home. Again, you can get started with just your spare change. To add to acorns.com slash ride or download the Acorns app to start saving and investing for your future today. Again, acorns.com slash ride or download the Acorns app.

Paid non-client endorsement, compensation provides incentive to positively promote acorns, investing involves risk. Acorns advisors LLC and SEC registered investment advisor view important disclosures at acorns.com slash ride.

One of the things in your piece that you maybe suggest is the idea that maybe people sleep on the fact that changing configuration is safer than changing code and in reality, as you point out, there's been a lot of outages, large outages, recently that this was, the root cause was configuration changes. Is this sort of suggesting that people need to realign their thinking in terms of the risk profile of doing config? I think so.

I think that a lot of people understand the risk of doing configuration, not everybody and I think that it almost all massive outages like this end up being configuration changes. I think specifically in the case of CrowdStrike, I doubt that it was seen as configuration. It was this proprietary binary file and then there's all these other layers of proprietary stuff that's interpreting it and things like that.

When you zoom out and look at it, it is a sort of modular thing that gets installed, which is the sensor, that takes configuration that teaches it what suspicious stuff to be looking for. And so even though it's hidden under many, many layers of proprietary in-house stuff, it effectively is a config file. But I don't think it was being treated that way.

I think that not many people would deploy a config file to production without testing it, but because it was wrapped in so many layers of application-specific stuff, it was sort of seen as not really a config file, not really something that could possibly cause a breakage.

So I think that a lot of people understand the potential huge impact here, but of configuration, but I think in this particular instance, it was somewhat covered up by lots and lots of layers of applications, specific stuff that you certainly wouldn't go and change a config file directly in product, but that's kind of essentially what was happening here if you really strip it back. Real quick, and this is purely opinion, but what do you think of their response?

Because there's been back and forth between certain customers like Delta, you didn't help us enough in there saying, well, we did reach out to help you, or just on a broad sense, how do you think that they responded to this? Given that, again, this is one of the biggest in history, so I don't know how you can get an A+. For any of this. Yeah, I mean, the Delta stuff has been hilarious to watch. Go back and forth is very entertaining. The Reddit comments as well were good.

I think overall, it wasn't too bad. You really did have to read between the lines to see what actually happened, which it would have been nice to see less of. It would have been nice to not have to work quite so hard. The amount of jargon as well, like, had I not written this blog post, I wrote it mostly for my own understanding because there's so much jargon in there. I understand that they kind of had to, but I think it could have been more simplified.

Then when they released the full root cause analysis, the first two paragraphs being marketing talk about the powerful on-sense to AI and how each sense of correlates context from its local graph store and stuff like that, I think that was a bit of a slap in the face. Given what had happened, I didn't need to read two paragraphs of marketing stuff at the beginning of that root cause analysis.

I think that was in very poor taste, but to be honest, the mitigations that they're putting in place are reasonable. Once again, the mitigations sort of confirm that my reading between the lines was correct in that they are implementing testing.

They don't say testing for the first time, but they are implementing testing in this particular workflow, which I think is, it's certainly going to help something like this, where you have a bug that is so completely catastrophic that it breaks every single thing that it touches immediately. Any amount of testing will catch that. Will it catch bugs that are like a 1% chance thing? Maybe maybe not.

That's where you need to be doing stage deployments and it does say that they are going to give customers the ability to control it. Frankly, I don't think that any security vendor will be able to go to their customers anymore and say, hey, we just push updates out to you and you don't have control over it. I think that while I'm sure a lot of security vendors do the same thing, those days are done.

If you're having a renewal conversation, I think that will be the heart of the renewal conversation is out and we control the updates and get pushed out because it's not reasonable. I think that what they've done will definitely stop something this big because just running it on your local laptop probably would have caught something this likely to occur.

But hopefully giving customers more control over the way it's rolled out will help for the 1% of things that only affect certain configurations, which might be really detrimental to a single customer because everything they have fits into that 1%, but are not going to take down everything in quite such a spectacular manner.

So let's bring this back to Overmind and let's imagine that someone listening out there, maybe they're not working at CrowdStrike, but they're going to push out something similar with their product. In a little bit more detail, can you describe to me tangible ways that Overmind would help prevent a CrowdStrike disaster like this for the listener? In fairness to my own customers, not many people are not deploying to a test environment before they deploy their production.

So even without Overmind, they have at least a somewhat representative example that does help you to get some degree of confidence. The problem is that deploying things into production, it's never quite exactly the same as testing. There's always more dependencies, things are larger. There's things that have existed for a long time and that people have forgotten what they do and are usually not documented and those dependencies are not well understood.

And so what we really specialize in is especially when you're going to production, being able to see what the dependencies are in that specific environment that we've captured in real time. So we go out, we find what your AWS looks like, we find that dependencies in real time to look at it right now.

So if somebody has, for example, and this is something that happened to one of our customers somewhat recently, was they had used a security group in AWS for something and they managed it with Terraform and done everything by the book, but they'd given it a name like internet access or some really generic name. And they were cleaning up after this project and deleting the security group and they deleted it in all of the other environments and it was fine.

And when they deleted it into production, in production, a huge amount of their fleet just stopped working. And it was because other teams were not using Terraform, they were doing things manually. And because it had such an incredibly generic name, people had just selected that security group.

And so it meant that by changing the rules on that security group, they were changing the rules for a huge amount of internal stuff because everyone had just been using it because they thought they were supposed to because it had such a good name. And so by looking at things in production, not basing it on how it worked in test, because it was in test, it didn't really matter. People weren't doing as much stuff manually.

These other teams were not creating this dependency, but then in production they were. And so it meant that without actually doing the risk analysis again, doing the blast radius calculation again in prod, it wouldn't have not have been possible to catch something like that. And in the case of things like CrowdStrike, probably it would have been possible to catch it with testing.

But the more common outages are not things that are core in testing because mostly people are already doing it, they happen because of discrepancies between testing and prod, discrepancies between dependencies and things like that. And that's sort of what we specialize in. I don't think I've mentioned yet that you can find out more at overmind.tech. OVER, M-I-N-D-D-D-TEC.

I'm going to link to the blog post that we've been discussing here, but also if anyone listening is interested in finding out more about overmind, how should they get in touch? What do you want people to know about what you all are doing right now? The easiest way to find out more about overmind is just install our CLI and run over my and Terraform plan. It's just like a normal Terraform plan, but you get a blast radius and you get risks.

That's certainly the easiest way to get started if you want to speak to me about it. Hit me up on the website. There's a contact form. It's great to me. I'd love to speak to you about it as well, but certainly the easiest way to get started. Install the CLI and run overmind Terraform plan. Beautiful. Again, that's overmind.tech. Dylan, thanks for giving us an explainer and giving us possible solutions so that this doesn't happen to you. Thank you very much, Ryan.

This transcript was generated by Metacast using AI and may contain inaccuracies. Learn more about transcripts.
(BNS) The Crowdstrike Thing (With Overmind.tech) | Techmeme Ride Home podcast - Listen or read transcript on Metacast