14 Oct 2024

Following July’s CrowdStrike outage, this is how we can avoid the next ‘Blue Friday’

-1

On 19 July, the world experienced ‘Blue Friday’ when 8.5 million PC screens worldwide were hit by the infamous Blue Screen of Death (BSOD), a critical error message displayed by Microsoft Windows when the system encounters a serious issue that it cannot recover from.

This unprecedented outage was triggered by a faulty update from CrowdStrike, a leading cybersecurity provider. The disruption affected government services, emergency operations, transport, payment systems and financial markets across the world.

While not a cyberattack, its scale was historic and raised two major concerns. First, the failure stemmed from a CrowdStrike update that was meant to protect networks. Second, fixing the problem required manual intervention – rebooting each computer into Windows’ Safe Mode and applying a patch, a very time-consuming process when you’re talking about millions of devices.

Some have since argued that new regulation is necessary to ensure such a shock doesn’t happen again – in the EU, this isn’t necessary. We already have what we need in the form of two strong pieces of legislation, the Network and Information System 2 Directive (NIS2) and the Cyber Resilience Act (CRA). They just need to be implemented properly, combined with better preparedness and a concerted effort to expand the number of trusted security software providers.

How did it all happen?

CrowdStrike isn’t a novice. It’s a trusted cybersecurity company, serving over half of the Fortune 500, with a track record in investigating major incidents such as the 2014 Sony Pictures hack and the 2016 leak of Hillary Clinton’s emails.

A software update like the one that caused ‘Blue Friday’ can access Windows in two modes – either through a restricted ‘user space’ where applications run with limited access to all hardware/system resources or a ‘kernel space’ where access is unrestricted. In the former, an incident causes only the individual application to shut down, not the whole system, while in kernel mode, software failure causes the entire system to crash.

CrowdStrike updates its security software automatically and silently at the kernel level as all updates can then happen as quickly as possible. But, according to some, CrowdStrike isn’t the only security company to cause Windows crashes – what made the July incident historic was its sheer scale.

Industry implications

The CrowdStrike incident highlights the systemic fragility of the internet’s core infrastructure and points out two specific critical issues – our over-reliance on a small pool of single providers and the risks associated with automated software updates.

Unlike industries where consumers enjoy variety, the software industry converges towards uniformity – most individual and business users, thanks to strong network effects, prefer using the same operating systems, like Microsoft Windows. The benefits of this include easier management, larger support networks and more compatible products.

However, such ‘software monoculture’ can lead to cascading disruptions when something goes wrong. In short, the more systems that rely on the same software and cybersecurity solutions, the greater the potential impact of any serious malfunction.

In 2009, Microsoft, after consulting with EU regulators, voluntarily agreed to open its Windows Vista operating system by building kernel-level application programming interfaces (APIs), thus allowing third-party security providers to access parts of its operating systems to offer security products and services. Consequently, this generated a vast market for ‘endpoint security’, valued today at around USD 15 billion with several major competing companies, including CrowdStrike.

Specifically, CrowdStrike provides Microsoft with the Falcon system, which repels attacks on Microsoft Windows. Its kernel update to Falcon caused the 19 July incident, underscoring the risks of concentrating critical services in the hands of just a few cybersecurity companies.

The incident also highlighted how many organisations were completely unprepared to handle large-scale outages. Defensive programming, contingency plans and having backup systems in place) are all critical – and Blue Friday clearly showed just how much this was all lacking.

As CrowdStrike’s faulty update was done automatically, software users may now demand that this operating model be changed. But this issue is more complex and leads to the more general question over how to properly balance interoperability and cybersecurity.

Some experts claim that mandating operating system interoperability with security vendors’ products, while increasing competition among a relatively small group of providers, has negatively impacted security. Thus, how APIs are designed – especially how resilient and interoperable they are – becomes a critical consideration. This is why governments and corporations need to rethink their reliance on the small number of providers and consider diversifying their software solutions to avoid similar future incidents.

We don’t need any new EU regulations

‘Blue Friday’ also spurred debate among policymakers on how best to respond to it, with the US administration considering  new measures  on cybersecurity software resilience.

But we don’t need any new regulations here in the EU – we already have two powerful pieces of legislation that, if properly implemented, could help to prevent and mitigate any future systemic outage, namely the NIS2 and the CRA.

The NIS2 envisages a crisis management framework and a series of cybersecurity risk management measures, while the CRA, focused on products with digital elements, contains a set of requirements to make software more resilient.

When it comes to software updates, the CRA requires automatic updates for products that don’t present specific cybersecurity risks but not for software for critical products, i.e. products that do present specific cybersecurity risks. In the legislation’s implementation phase, it should be specified that companies using critical products should have the default option to decide for themselves if an update is automatic or not (an opt-in rather than an opt-out). This would give companies more control over their software and systems when using critical products.

Will we learn from Blue Friday?

The CrowdStrike outage wasn’t a cyberattack. But it did reveal the fragility of our dependence on certain software and a small number of companies whose job is to protect said software.

To prevent and mitigate any future outage, we need to increase the number of providers and encourage companies and organisations to better prepare themselves to handle and respond to systemic outages. They can do this using recovering mechanisms, defensive programming, contingency plans and redundancy solutions.

And again, crucially, we absolutely don’t need any new legislation in the EU – instead, the rapid and effective implementation of the NIS2 and the CRA should be enough to help mitigate the fallout from any future outage.