CyberHoot

July 30, 2024

Understanding the CrowdStrike Global Outage

On Fri. July 19th, CrowdStrike, a leading provider of Endpoint Detection and Response services, caused the single largest computer outage in history. The outage is estimated to have cost fortune 500 businesses $5.4 billion (Parametrix Report)! Truthfully, this type of event has happened many times in the past, though with globally impact! In 2006, Microsoft caused a BSOD with a patch update. In 2010 a McAfee virus signature update did the same. This blog breaks down what happened, what it means, and how organizations should take from the event. History repeats itself for those who forget. Let’s look at this latest incident then to understand it better.

The Incident: What Happened?

On July 19th, 2024, systems running Windows 7 and above with CrowdStrike’s Falcon sensor received a faulty channel file. This file called a memory location that did not exist, leading to kernel instability and a Blue Screen of Death (BSOD) loop. Channel File 291 caused the issue, and though it was only distributed for one hour, an estimated 8.5 million Windows devices crashed worldwide. The crash persisted across reboots and required a local machine visit to fix the issue. The event led to widespread disruptions in manufacturing, air travel, hospitals, and so many more enterprises globally. It led some to ask whether there was a nation state attack in progress. Fortunately that was not the case; a simple update gone wrong was the root cause which we’ll detail next.

Simple Mistake not a Cyberattack

CrowdStrike confirmed the outage was not due to a cyberattack, but rather a faulty update that was improperly tested. The CEO of CrowdStrike was coincidentally in charge at McAfee back in the BSOD event in 2010 but appears not to have learned from his past mistake. In light of this global BSOD, CrowdStrike offered the following changes to its business practices to avoid a similar mistake in the future.

Increased 3rd party reviews: CrowdStrike will hire more 3rd parties to review their business practices seeking opportunities for business process improvements.
Enhanced Update Testing: CEO George Kurtz has promised to redouble update testing processes (that broke down in this event) to ensure the quality of future updates.
Staged Deployments: CEO Kurtz promised to deploy updates in phases to groups of end user computers. This could ensure an error of this magnitude is identified early in the deployment schedule and all updates stopped. [Editors Note: this was the take-away prevention measure in 2005 for Windows patches and 2010 for McAfee. Let’s not take updates immediately, but stage them for a day later to prevent being the BSOD guinea pigs!]

These are all great measures to adopt in the face of such an incident. Will it be enough for CrowdStrike to recover from? Much of that depends on what the long-term implications of such a widespread outage will lead to. Here now are some possible outcomes for CrowdStrike.

Implications of a BSOD Outage on a Vendor

A global outage at a cybersecurity firm like CrowdStrike has several significant implications:

Erosion of Trust: clients have lost trust in CrowdStrike. An outage like this erodes confidence in CrowdStrike’s internal safeguards and business practices.
Operational Disruptions: Basic business functions were completely down and unrecoverable for an extended period of time.
Security Monitoring Outages: Organizations depend on continuous monitoring to defend against threats. An outage like this disrupts monitoring operations which could allow attacks to go unnoticed.
Increased Vulnerability: During an outage, organizations are more vulnerable to other attacks, as their attention is elsewhere and their usual defenses are offline.

Risk Mitigation Strategies

In light of this incident and similar incident’s in the past outlined above, organizations must plan for future BSOD events. Here are some practical steps any organization can take to partially mitigate and recover more quickly from these events:

Delay updates: Immediately applying vendor updates, signature updates, or patches can lead to global outages as we’ve seen here. Delaying adoption of such updates by even 12 hours can prevent nightmare scenarios.
Diversify OS Systems: This impacted all Windows computers running CrowdStrike. Consider operating some critical systems on alternate operating systems (Unix, Mac, Linus), or in extreme cases, where uptime is mandatory, perhaps a diversity of security software (though CyberHoot recognizes this is very unusual and unlikely for most businesses).
Perform and Test Backups: Regularly back up data and system configurations and test restores quarterly. In the event of an outage, you’ll be more confident you can quickly restore systems, data, and maintain operations.
Business Continuity and Disaster Recovery Plan: during the CrowdStrike BSOD, many businesses were left to exercise their BCDR for the first time. Gaps existed in most cases, and many times individuals did not know or understand their role in recovery. Draft a BCDR plan and then practice it annually in a Table-Top exercise to identify gaps and ensure responsible parties are practiced in their roles.
Stay Informed: Keep up with news and updates from your security providers. Understanding the latest events, threats, and possible mitigations can help you better prepare for potential issues. In the CrowdStrike event, its purported that 15 system reboots returned a system to operational status.

Conclusions

The CrowdStrike global outage offers us a potent reminder of what can happen when a mainstream product causes a major outage. Such incidents remind us of how important a multi-layered security strategy can be. It also reminds us to be better prepared by practicing our Incident Management and BCDR plan ahead of time. Given this BSOD threat will always exist, we can partially mitigate the risk by slightly delaying updates (12 to 24 hours).

In a world where cyber threats are constantly evolving, staying proactive and alert is the best defense. Understanding this BSOD risk and implementing various protective strategies will help your organization remain secure and resilient in the face of such unexpected challenges.

Secure your business with CyberHoot Today!!!

Sources and Additional Reading:

CrowdStrike Global Outage – Threat Actor Activity and Risk Mitigation Strategies