On the 19th July 2024, the world witnessed what could easily be described as the largest global outage IT outage in history when CrowdStrike, a cyber security solutions provider, shipped a defective security update that impacted an estimated 8.5 million Windows computers worldwide.
Whilst the number of affected computers equates to less than 1% of the estimated installation base of Windows 10 and 11 (circa 1.4 billion installs), news outlets and social media were reporting outages affecting aviation, banking and finance, healthcare, retail, transportation and government services.
The defective security update prevented Windows computers from booting correctly, resulting in the all too common Blue Screen of Death (BSOD).
Stuck at the airport for a redeye due to a CrowdStrike update which caused a massive outage worldwide….
— ??ℎ????????®? (@JTerryy07) July 19, 2024
You get a Blue Screen of Death
You get a Blue Screen of Death
You get a Blue Screen of Death
You get a Blue Screen of Death pic.twitter.com/Oi5O2iynYs
Why did it take so long to resolve?
CrowdStrike were quick to publish an official workaround for fixing the defective security update. In essence, this boiled down to deleting the update file.
Users were advised to:
Reboot the host normally
Reboot your Windows installation in safe mode or the Windows Recovery Environment
Navigate to the %WINDIR%\System32\drivers\CrowdStrike directory
Locate and delete the “C-00000291*.sys file
On the surface this is a quick fix that should only take 5-10 minutes to apply. However, many impacted businesses use the Bitlocker disk encryption functionality built into Windows to protect their data.
If a system is protected with Bitlocker, it is not possible to boot the system into safe mode without knowing the Bitlocker Recovery Key. This issue is also further amplified by the fact that it is not possible to input the recovery key remotely, therefore for many impacted businesses the patch had to be applied locally to the device.
How long did it take to fix the issue?
CrowdStrike were able to fix the issue 79 minutes after the root cause was identified. Whilst this prevented the issue affecting more installations of Windows that used CrowdStrike, it didn’t resolve the issue for those who had already received the bad update.
For those who were affected by the defunct update, the recovery time for businesses ranged from hours to days. Many factors contributed to the variance in recovery times for organisations, including:
- Existing disaster recovery plans
- Ability to configure and manage machines remotely
- Deployment topology – those using virtual desktops were able to roll back more rapidly to previous backups verses those using physical workstations
What was the impact?
Insurance services firm Parametrix, estimates that the outage impacted a quarter of the Fortune 500 companies, with losses estimated in the region of $5.4 billion.
CrowdStrike, also suffered its own losses with their share price plummeting from $343 a share to $217.
How can businesses be better prepared for technology outages?
Backup, Redundant Systems and Continuity/Disaster Planning
Outages that affect the availability of systems are often mitigated via redundancy. Businesses can put in place redundant systems and business continuity and disaster recovery (DR) plans to ensure that there are clear plans and backup systems in place should primary systems fail.
Consider the implications of security patching policies
There is a clear argument that CrowdStrike should have tested the update more thoroughly before it was released to the public. However, it is worth noting that in a cyber security context, a security provider can find themselves in a challenging situation where thoroughness of testing is important, but so is getting a new security patch out promptly so that it can be patched before an adversary exploits it.
Redundant digital systems may not be enough for some industries and situations. Businesses will need to think carefully about how they apply patches to backup and DR systems – afterall, these systems need to be secure. Propagating a security update immediately to a DR/backup system will ensure it is up-to-date with the latest security patches, however, in this context, a patching policy like this will have taken down DR and backup systems offline too. The opposite to this would be to delay patching backup and DR systems, which would leave them vulnerable to known attacks.
There is still value in being able to fallback to manual
Businesses would benefit from establishing redundant manual processes where this is applicable. For example, many airlines had to fall back to paper based check in processes when the CrowdStrike outage struck. The execution was however marred by a lack of practice and training.
Limiting the attack surface
One thing that the CrowdStrike outage highlighted to the general public is how many information boards and other such systems run a fully-fledged Windows operating system behind the scenes.
Many appliance type devices (such as information boards) don’t warrant a fully fledged operating system for what they need. However, if a full-fat operating system is being used, it needs to be fully protected with a security tool such as CrowdStrike.
If however, more of these appliances either used smaller embedded systems, or systems that could be fully remotely managed, this would bring a number of benefits:
- Smaller, embedded systems have few components and elements of complexity to them which means the attack service is smaller and the security requirements are proportionally simplified
- Systems that can be fully remotely managed, including debugging, troubleshooting and rebooting can be more rapidly and quickly managed in the event of an incident.
Remote troubleshooting and configuration management
Businesses should invest in the ability to remotely configure and manage their technology estate – whether this be workstations or other edge devices or appliances.
Configuration management isn’t a new topic, but the CrowdStrike incident from July demonstrates that many organisations still have room to refine how they implement and achieve this.
Afterall, as soon as something goes wrong, remote troubleshooting and configuration management becomes critical to resolving issues quickly. If the backbone for this infrastructure doesn’t exist from the outset it can be complex, time consuming and expensive to retro-fit.