The Day CrowdStrike Broke The World

A look at what happened from both the technical and the business side of things.
July 20, 2024 by
The Day CrowdStrike Broke The World
Synephore, Jonathan Hall
| No comments yet

Unless you've been living under a rock - or you have the luxury of living on a secluded island away from the digital dependence the rest of the world has come to have - then surely you've heard about the global IT outage on Friday that abruptly threw airports, hospitals, shops, large corporate chains, emergency call centers and even your own kids back to the stone age. 

For me personally, my initial discovery of fact that something was very wrong was waking up to my wife complaining about how long the Windows update took to install. That agitation turned in to a sudden panic as she was greeted with the infamous Blue Screen of Death in a repetitive boot loop. 

This sent her running out of the door and heading to her office as though they were giving away free money. It wouldn't be long before she called me in shock describing a scene of chaos in the office, every computer screen she walked past on the way to her desk viciously taunting her; all adorned with ascii frowning faces and an unhelpful message of encountering some type of problem.

At that point I had not seen the news yet, but when I opened up my browser I was immediately greeted with a sea of posts on every media outlet officially declaring a tech apocalypse falling upon us. My first thought?

Dammit, Microsoft - what have you done?

Outrage over Microsoft update...

So, a Microsoft update broke everything. At least that's what the initial media frenzy claimed. Who could blame them? Microsoft 365 outages were some of the first topics to hit the global media.

However, Microsoft was quick to denounce the blame and cleared the air on what happened, informing the world that it was not a Microsoft update that turned back the sands of time on our progress as a civilization, but instead an update pushed out by a company called CrowdStrike.

Wait, who is CrowdStrike?

CrowdStrike is a very large Cyber Security firm that most likely the majority of people, including many of us in tech, have never even heard of before now, despite how deeply embedded they are in major companies Windows infrastructures. Ironic, isn't it? 

Not nearly as ironic as the fact that George Kurtz, the CEO of CrowdStrike, has been in a similar boat before in 2010 when ee was at the helm of McAfee as CTO. Yes, the same man who is CEO of CrowdStrike was CTO of McAfee when there was a similar worldwide outage caused by their product which took down nearly the entire internet.

The product in question this time - though again under his watch - is called CrowdStrike Falcon,​ a digital threat detection software used by, well... Apparently quite a lot of organizations. I think it's safe to say it's used by just about everyone that actually mattered to us, as evidenced by the world grinding to a halt when it went haywire.

If you're familiar with the old Intrusion Detection Systems (IDS) such as Snort, it's a somewhat similar concept: threat detection based on pattern identification. Only this product is focused on Windows machines, consisting of a local component that runs on individual hosts and lives in the kernel space. Even more interesting is it's blessed with the intelligence of AI. Before we go insisting AI will replace the world, let's keep in mind it clearly didn't prevent things from going boom here.

In the simplest of terms, CrowdStrike Falcon is an Intrusion Detection and Prevention System that uses definitions of patterns to identify potential threats and bad actors on systems so it can raise awareness of such suspicious activities to the system administrators. They're meant to save the technological world, not aid the bad guys in taking it down.

So how did it go rogue and break everything?

In order to keep up with the rate at which security threats are increasing, definition files for Intrusion Detection Systems, IPS and malware systems are pushing out even more frequent updates than they ever have before. CrowdStrike is no exception to this, and it is actually quite frequently doing just that with no incidents of the sort. They're usually pretty good about it given we've not had a global meltdown before now.

In this instance, an update to those definitions was pushed out by CrowdStrike on Thursday, July 18, which unfortunately caused a problem when being loaded in to their component. Given that component lives within the kernel space - which is the core of any operating system - the bug rendered the kernel, and thus Windows, unable to boot. But hey, the host can't get looted if it can't get booted, so it doesn't get much more secure than being bricked! Too soon?

Naturally I'm looking to add some humour in to things, because the truth is that recovering from this is anything but funny. It requires manually putting your hands on the machine in almost all cases, unless of course you're willing to lose all of your data and just network boot a clean install. For many it is still a tedious ongoing process with little, if any, sleep for the operations teams. Of course, nuke and pave is exactly what some companies did because it was just simply easier and faster to recover that way.

Good thing you bought that cloud backup service, right?

So, what did we learn from the technical side of things?

As an engineer, I've always been at odds with folks from the security side. Their responsibility to identify threats and protect organizations and infrastructures from a world of evil no-good-doers really does not come easy. In so very many cases they're amazing at very low-level technology, but the challenge I often find is that they can sometimes miss the larger impact of the solutions they're proposing while on their thankless quest of fortifying the infrastructure against fifteen year old kids wearing Guy Fawkes masks in addition to state sponsored hackers.

Most often it's up to the systems administrators to push back on security teams when things are being suggested risk stability and usability of any operating system. Most often, they'll lose the battle because in almost all companies, the security teams have a license to override anyone and everyone and simply materialize unquestionable policy from thin air. This is however most certainly one of those cases where I believe more pushback from corporate IT staff was required: a single point of failure that could take down every Windows system in their infrastructure if it went awry. Even worse, a single target for hackers to breach to be able to bring the world down to its knees. Really?

So how else are we supposed identify when and if the bad guys get in?

Remember earlier when I mentioned Snort? It's far from perfect and I'm not saying it would be a more suitable replacement by any means - but there's one major difference between it and CrowdStrike's product: it's not installed on every single machine, and it's not in the critical path of the OS booting up. It's using network sniffing tactics to monitor for traffic that matches patterns of malicious activity, and when your Snort box goes down - it doesn't take every single system down with it. It also shows you don't need to be on-host to identify malicious traffic patterns on your network.

For other levels of security on the system, there are products out there which do not sit directly in the path of the kernel. What these utilities are is effectively rootkits with a legitimate purpose, but they leave systems in a very compromising position when rules are getting overly burdensome or a mistake is made in policy updates.

While the desire to tighten down and protect systems as much as possible in todays overly insecure and digitally dependent world is well-understood, the increasing amount of on-host security mechanisms is beginning to put burden and strain on systems across the globe, challenging even the most senior and capable admins and engineers alike to maintain stable, usable running systems. Arguing about the risks that an on-host security tool can present with any level of technical management today rarely ever results in a successful outcome. If anything, it tends to more frequently line you up for the slaughter from a political perspective. So it's often left up to the admins to do the best they can to prepare for any potential side-effects from those solutions. But clearly, some people dropped the ball on that one.

Just like making code changes and deployments requires a significant amount of testing, so do the patches you're deploying on to your systems. So how is it that a patch which rendered systems unable to even boot ended up being distributed to every system in all of these impacted organizations?

Don't patches happen automatically for all of our machines?

As a general rule of thumb, most of us do not blindly install patches - not even on our home machines. We wait a few days if not weeks to first see who screams about what got broken from the updates before we risk our dearly beloved machines. Yes, it happens that frequently, even from top names like Apple.

Unfortunately, no product is immune to such mistakes - they do happen everywhere these days. However, in large companies with mission-critical systems, we have a plethora of tools at our disposal to prevent exactly this scenario.

In the case of Windows updates in companies and enterprises, Microsoft provides a pretty decent solution for managing your patches called the Windows Server Update Service. When used properly, administrators can control which patches and updates will be deployed to which sets of machines, and when.

For Linux, tools such as Foreman or Red Hat Satellite, do the same thing. They leverage snapshots of the software repositories to control rollouts in a phased manner by pointing subsets of hosts (i.e. lab rats) towards those snapshots before dumping it out to the rest of the infrastructure.

Sounds like a no brainer to implement such a thing in to the patching protocol of every organization, doesn't it?

Seriously - it's time to get a grip on tech.

We're facing a nearly existential threat at the rate we're going with technology. Our entire lives revolve around it. Whether that's good or bad is outside of scope for this article, but it's an unfortunate truth we need to face.

We are living in a day in age where technical mistakes are costing peoples lives. A key example was the Boeing MCAS system that killed more than 300 people before attention was shifted towards it, despite several engineers from Boeing previously raising concerns to management over safety concerns with its implementation. Some even resigned over it. It wasn't until massive probes by the federal government resulted in grounding of entire fleets of their newest model - followed by subsequent stock prices plummeting - that they finally made any adjustments. Even still, these are the same guys who wrote it, and it was meant mitigate a previous engineering flaw with the engine placement instead of actually fixing it in the first place.

We need to start getting a much better grip on technology, especially in organizations which are key to the survival and function of our society. The answer to doing that is to introduce better controls around technical infrastructure, but this really comes down to who you have determining and implementing those controls. Policy that hinders progress does not guarantee stability, but instead slows progress and shoves the correct talent away.

The point is not to make technical changes harder, but there needs to be appropriate oversight and accountability over the policies in organizations technical divisions. 

So as not to mince words: it comes down to the leaders you have driving your technology infrastructure need to reflect on what just happened and begin assessing how they prevent it from occurring again in the future.


The Day CrowdStrike Broke The World
Synephore, Jonathan Hall July 20, 2024
Share this post
Tags
Archive
Sign in to leave a comment