Mad biracial woman frustrated by computer operational problems
Mad african American female sit at table look at laptop screen frustrated by operational software system problems, angry biracial woman confused by slow internet connection, virus attack on computer
Image Credit: fizkes - Adobe Stock
Share
Facebook
Twitter
LinkedIn
Flipboard
Print
Email

Democracy & Technology Blog What We Can Learn From the Crowdstrike Fiasco

Originally published at Mind Matters

The Crowdstrike platform is a piece of cybersecurity software that has been deployed to millions of computers worldwide.  While it supports several different operating systems, it is primarily used on Windows computers. 

What happened and why

On July 19, an update was pushed to Windows computers running Crowdstrike that caused them to completely fail.  This disrupted many different sectors, but the worst impact was on airlines, banks, and healthcare.  The problem was quickly fixed at Crowdstrike, but the damage was done. Each computer required manual intervention to get working again.

This obviously points to problematic internal controls at Crowdstrike. First, any update should have been tested internally before deployment. Second, deployments at this scale should be done in a rolling fashion with feedback mechanisms to prevent system-wide catastrophes such as this one. 

However, I want to take a look more broadly at this problem and what it might be teaching us about how we look at technological solutions generally.

The risk of removing all risk

In modern society, we oftentimes try to remove risk altogether.  We buy insurance for everything under the sun, we save for retirement, and we have lots of rules to make sure nobody gets hurt.  The question, though, is whether we are actually removing risk, or just moving it somewhere else.  Nicholas Nassim Taleb, investor and author of a number of books on business, finance, and investing, has been warning the world for years about the fact that a lot of things that we try to do to remove risk actually makes the problem worse but also less visible.

When we think about risk, we normally think about “normal” distributions, where risk is spread out pretty evenly across the spectrum.  The more catastrophic the event is, the less likely it is to occur.  Our expectation is that, if we remove risk in the ordinary things, we are also removing risk at the extremes.  However, the fact of the matter is that the opposite is true.  When we remove risk in the ordinary things, we are often adding risk at the extreme ends.  This creates distributions that have what are known as “fat tails”—an increased probability that extreme events will happen.

If everybody buys insurance, what happens when the insurance company goes broke?  If everybody “plays it safe” in ordinary life, what happens when extreme acts of heroism are required and nobody is up to the task?  Oftentimes, what we gain in de-risking the short and medium term shows up as fat tails causing extreme failures to become more likely.  We can convince ourselves that they won’t happen because they don’t happen often, and then when they do occur we can just act like it is one of those things that nobody can control. 

The tail risk

Our society is laser-focused on near-term, first-order effects of actions, and almost entirely blind to their larger-scale and second-order effects. In the case of Crowdstrike, people are de-risking their day-to-day security operations by giving it over to a third party company to do it for them.  Additionally, auditors take this to be a positive thing, oftentimes bypassing large swaths of questions just by knowing that a company puts its computers under the control of Crowdstrike.  What they are missing is the tail risk that is added by doing this.

In this case, it was a faulty update.  But there are other tail risks to consider.   What happens if a bad actor gets a prominent place at Crowdstrike (or a similar firm)?  What happens if someone figures out a vulnerability in Crowdstrike that causes computers which run it to be less safe? 

In short, many managers have spent their time only considering near-term risks.  It is time for IT managers to also consider the “tail risks” of their decisions.  We need to make tail risks an ordinary part of our vocabulary and considerations. Maybe the amount of effort saved in the short run makes disasters like this worthwhile and then perhaps we should just plan for them.  But this should be a deliberate decision, made with full knowledge of how near-term decisions affect long-term tail risks.

Jonathan Bartlett

Senior Fellow, Walter Bradley Center for Natural & Artificial Intelligence
Jonathan Bartlett is a senior software engineer at McElroy Manufacturing. Jonathan is part of McElroy's Digital Products group, where he writes software for their next generation of tablet-controlled machines as well as develops mobile applications for customers to better manage their equipment and jobs. He also offers his time as the Director of The Blyth Institute, focusing on the interplay between mathematics, philosophy, engineering, and science. Jonathan is the author of several textbooks and edited volumes which have been used by universities as diverse as Princeton and DeVry.