CrowdStrike Outage 2024
By: Anoop Sandhu | 7/19/24
“Major global IT outage grounds flights, hits banks and businesses,” “Emergency services numbers offline worldwide,” “Operations delayed as hospitals pause services.”
These are just a snapshot of some of the headlines the world woke up to today after a major outage from a CrowdStrike patch gone bad.
Who is CrowdStrike?
CrowdStrike is a leading worldwide cloud-based cybersecurity platform, who has been involved in investigating and assisting in some of the world’s largest breaches and cybersecurity incidents. They represent a gold standard in cybersecurity across the globe with over 29,000 customers.
What Happened?
The worldwide outage is traced back to CrowdStrike’s “Falcon” platform. This is a cloud-based solution that combines multiple solutions into a single product. Antivirus, threat detection, real-time monitoring and more. The update installed faulty software that caused Window’s machines to experience the dreaded “Blue Screen of Death” and to get stuck in a boot loop. Essentially, once the update was installed and you restarted, your machine would not boot up.
Resolution
CrowdStrike is actively working on a permanent fix, but in the immediate hours after the incident the only way to get up and running was to have access to the affected machines so files could be deleted. IT would need to be able to log into the server, find the faulty file, delete it from the machine, and subsequently reboot the instance. A truly daunting task for those that found themselves with hundreds or even thousands of affected devices.
How! Why?
With the scope of this historic incident, the “How’s” and the “Why’s” are truly going to be dissected with organizations and governments around the world looking for answers, and in many cases looking for alternatives. This event simply proves that even an organization doing everything “right” -in this case using one of the best security vendors in the world- can still have their systems impacted and suffer real consequences. But that does not mean we have to accept that there is nothing we can do.
Mitigation
There are many steps an organization can take to reduce risk and mitigate the impact of events like this.
While you cannot plan for everything, some fundamental best practices hold true that can help.
1. Do not update all your servers at once
A simple strategy to combat bad updates is to perform rolling deployments with time baked in between for validation. While SilverTech was not affected by CrowdStrike, one of our managed clients was and we were able to pause updates once an initial batch of servers went offline and we were notified by automated monitoring solutions. This bought us time to remediate the issue. Had all servers been updated on the same schedule, all would have gone down simultaneously and caused end user impact. The way this event rolled out, no end users saw any downtime or loss of services.
2. Do not test only on production
A time-tested strategy that still gets away from people, especially with patch management. Always have a non-production environment that mimics your live setup that gets patches first. The amount of time and effort to set up automated monitoring is generally trivial and even a single incident pays back dividends.
3. Have a plan B
It is not just software that can fail, a catastrophic event can affect regional services as well. Whenever possible, have a copy of your site or software in a different region available to takeover in case of an emergency to reduce the impact of a catastrophic event.
For example, SilverTech’s platinum hosting plan includes an active standby server in a completely different region. The globally load balanced Azure FrontDoor infrastructure constantly validates the health of the primary region and automatically rolls over to the secondary in case of a loss of availability. Should a region fail due to a disaster or a bad update (staggered updates 😊) you are covered.