In Brief: Computing Chaos
When CrowdStrike sneezes, the entire business world catches a cold.
On July 19, millions of people across the globe had their day suddenly turned upside down by a minor programming error. People were left stranded at airports as thousands of flights were canceled or delayed. People found they were unable to make purchases at some stores with their credit cards. And all of this was due to a software update to CrowdStrike.
While the problem was quickly resolved, “This sort of nations-spanning simultaneous failure would have been utterly impossible as recently as a decade ago,” observes columnist Jack Baruth, “but it’s highly likely to happen with increasing frequency from now on, thanks to a combination of individually benign, but collectively deadly, changes in our global technology infrastructure.”
The first of these problems, and the proximate cause behind the CrowdStrike outage, has to do with the way software is written in 2024. Historically, computer programs were the work of small, dedicated teams that understood their products from nose to tail. Often, a single person did the bulk of the work, as was the case for both the popular 1982 home video game River Raid, written by Activision employee Carol Shaw, and the powerful UNIX operating system, initially created by AT&T’s Ken Thompson. Software written in this fashion tended to be effective, efficient, and largely bug-free, which was important in an era without the possibility of remote software updates. It was also remarkably difficult to predict when it might be finished. The average pre-internet computer programming project was kind of like the later Steely Dan records: just a few enigmatic people running the show, with no accountability to management and little incentive to follow anything other than their own whims along the way.
But times have changed.
It’s not like that anymore. Most of today’s software is developed and released in two-week “sprint” intervals by teams of anonymous and interchangeable hired-gun, low-skill programmers, most of whom are sourced from overseas on a lowest-bidder basis. Each of them is given a tiny piece of the overall task on which to work. Rarely do they possess or even want a greater understanding of how their contributions fit in with the program as a whole. When there is a conflict between the work of two adjacent coders, it is resolved in automatic fashion by the tools with which they work, and not always correctly.
Baruth asserts that this offshoring of programming has created a business culture where programmers are considered “expendable commodities” while “onshore managers are irreplaceable assets who portray themselves as masters of ‘agile’ or ‘scrum’ methods to anonymize and dehumanize the people doing the actual work.”
Consequently, it is all but irresistible to American tech leaders, even if the promised cost savings from offshore code farms never materialize and even if the resulting product is subpar. Which it almost always is nowadays, in ways ranging from “this new phone is slower than my old one, even though it’s more powerful,” to “this airplane seems to fall out of the sky more often than we’d like.”
Of course, even the most incompetent software can’t hurt you if it isn’t installed on your computer, or if you have a chance to evaluate it on test systems before installing it. In the past, most major systems were operated by skilled personnel who had the last word on what went on “their” computers. It was common to test software patches or updates on a few systems before releasing them to the company as a whole. This didn’t happen with the CrowdStrike update because the Falcon program, which is supposed to protect computers against criminal hacking and external attacks, has authority that supersedes that of the system administrators. It could install its own updates from CrowdStrike at any time, without the consent of the computer owner. Which it did, pretty much everywhere all at once. Then the dominoes started to fall.
The problem is software companies having “absolute power” over their software, including the processing of updates. It’s the tech version of the culture of “we know what’s best for you.”
We are now dangerously close to a “monoculture” in many aspects of tech. The vast majority of cloud servers are run by Amazon, so when an outage strikes, as it did in the “US-East” region of Amazon Web Services on Dec. 7, 2021, the effects are immediate and far-reaching. The combination of Windows Server and CrowdStrike Falcon is common at more than half of the Fortune 500 companies, so when CrowdStrike sneezes, the whole business world catches a cold.
How did we get to the monoculture? Some of you may remember the old phrase “Nobody ever got fired for buying IBM.” The famously anticompetitive tech sector has used a series of technical partnerships and deliberate incompatibilities to extend this mindset to nearly every level of software and computing. CrowdStrike is an Amazon Web Services partner, a Dell partner, a Netskope partner, and so on. When you buy one product in the stack, you’re encouraged to buy the other products as well — so most tech leaders simply do the easiest thing.
Baruth concludes:
Yes, this outage was CrowdStrike’s fault. That’s like saying that the Challenger disaster was an O-ring problem. It doesn’t convey the broken nature of the system that let it happen. In this case, the lessons should be clear to every tech leader in America. Most of them won’t bother to learn those lessons or even take the smallest steps to prevent the next problem. After all, this outage is now handled. It’s history. There’s just one little problem: It’s the kind of history that is all but certain to repeat, again and again.