Democracy & Technology Blog What We Can Learn From the Crowdstrike Fiasco
Originally published at Mind MattersThe Crowdstrike platform is a piece of cybersecurity software that has been deployed to millions of computers worldwide. While it supports several different operating systems, it is primarily used on Windows computers.
What happened and why
On July 19, an update was pushed to Windows computers running Crowdstrike that caused them to completely fail. This disrupted many different sectors, but the worst impact was on airlines, banks, and healthcare. The problem was quickly fixed at Crowdstrike, but the damage was done. Each computer required manual intervention to get working again.
This obviously points to problematic internal controls at Crowdstrike. First, any update should have been tested internally before deployment. Second, deployments at this scale should be done in a rolling fashion with feedback mechanisms to prevent system-wide catastrophes such as this one.
However, I want to take a look more broadly at this problem and what it might be teaching us about how we look at technological solutions generally.
The risk of removing all risk
In modern society, we oftentimes try to remove risk altogether. We buy insurance for everything under the sun, we save for retirement, and we have lots of rules to make sure nobody gets hurt. The question, though, is whether we are actually removing risk, or just moving it somewhere else. Nicholas Nassim Taleb, investor and author of a number of books on business, finance, and investing, has been warning the world for years about the fact that a lot of things that we try to do to remove risk actually makes the problem worse but also less visible.
When we think about risk, we normally think about “normal” distributions, where risk is spread out pretty evenly across the spectrum. The more catastrophic the event is, the less likely it is to occur. Our expectation is that, if we remove risk in the ordinary things, we are also removing risk at the extremes. However, the fact of the matter is that the opposite is true. When we remove risk in the ordinary things, we are often adding risk at the extreme ends. This creates distributions that have what are known as “fat tails”—an increased probability that extreme events will happen.
If everybody buys insurance, what happens when the insurance company goes broke? If everybody “plays it safe” in ordinary life, what happens when extreme acts of heroism are required and nobody is up to the task? Oftentimes, what we gain in de-risking the short and medium term shows up as fat tails causing extreme failures to become more likely. We can convince ourselves that they won’t happen because they don’t happen often, and then when they do occur we can just act like it is one of those things that nobody can control.
The tail risk
Our society is laser-focused on near-term, first-order effects of actions, and almost entirely blind to their larger-scale and second-order effects. In the case of Crowdstrike, people are de-risking their day-to-day security operations by giving it over to a third party company to do it for them. Additionally, auditors take this to be a positive thing, oftentimes bypassing large swaths of questions just by knowing that a company puts its computers under the control of Crowdstrike. What they are missing is the tail risk that is added by doing this.
In this case, it was a faulty update. But there are other tail risks to consider. What happens if a bad actor gets a prominent place at Crowdstrike (or a similar firm)? What happens if someone figures out a vulnerability in Crowdstrike that causes computers which run it to be less safe?
In short, many managers have spent their time only considering near-term risks. It is time for IT managers to also consider the “tail risks” of their decisions. We need to make tail risks an ordinary part of our vocabulary and considerations. Maybe the amount of effort saved in the short run makes disasters like this worthwhile and then perhaps we should just plan for them. But this should be a deliberate decision, made with full knowledge of how near-term decisions affect long-term tail risks.