Skip to main content

Posts

Podcast: The Therac-25, buggy software that killed.

As part of an ongoing project to learn more about what we've got wrong to help us improve, I look at the Therac-25 incidents, a devastating collection of software failures that often rank in the top 10 of civilian radiation accidents. The Therac-25 radiation therapy device killed or injured 6 people across Canada and the United States. The Therac-25 was a room-sized machine, in this cut-away, you can see the computer terminal in the near-bottom left. I look into two of the most severe bugs. Why the manufacturer didn't fix them and what we can learn from their mistakes. The MP3 (Audio) file is available here .

Podcast: The Post Office Horizon Scandal

In this episode, we look at the Post Office Horizon scandal, an app that caused what some people are describing as the largest miscarriage of justice in British legal history. We look at some bugs, the legal judgements and what might have gone wrong at the Post Office to allow things to go so off track. I analyse what we can learn from the disaster to help stop this from happening in our own projects. The MP3 (Audio) file is available here.

Podcast: Voting Machine Fail

We wind the clock back to November 2019 and investigate the failure of voting machines in Northampton County, Pa., USA. We break down what went wrong, what caused the problem and what we can learn about the risks of software development from this high profile incident. The show notes and transcripts are  available for free .

Avoiding Wild Goose Chases While Debugging.

When I’m debugging a complex system, I’m constantly looking for patterns. I just ran this test code... What did I see in the log? I just processed a metric $^&*-load of data, did our memory footprint blip? I’m probably using every freedom-unit of screen space to tail logs, run a memory usage tool, run an IDE & debugger, watch a trace of API calls, run test code… And I’m doing this over and over. Then I see it. Bingo, that spike in API calls hits only when that process over there jumps to 20% processor usage when the app also throws that error. Unfortunately, I may have been mistaken. On a sufficiently complex system, the emergent behaviour can approach the appearance of randomness. Combinatorial explosions are for real, and they are happening constantly in your shiny new MacBook. My bug isn’t what I think it is. I’m examining so many variables in a system with dozens of subsystems at play, it's inevitable I will see a correlation. We know this more formally as...

Avoiding Death By Exposure

There's no such thing as a small bug. Customers, be they people or businesses, do not measure Software bugs in metres, feet or miles or kilograms. They use measures like time wasted, life-lost and money.  Take a recent bug from Facebook . It affected thousands, maybe millions of customers and the bottom line of companies (seemingly) unconnected with Facebook such as Spotify, Tik-Tok and SoundCloud, and probably countless smaller companies. So why did the journalist seem to think it was small? Too often we judge the systems we create by how likely they are to fail, given our narrow view of the world. A better measure is our exposure when the systems fail . The exposure for Facebook is a greater motivation for other companies to disentangle themselves from Facebook's SDK, or promote a rival platform. It doesn't matter if our bug is one tiny assumption or one character out of place, if it stops a million people from using or buying an app then it's a huge bug.  ...

Convexity in Predictive Value & Why Your Tests Are Flaky.

A long time ago, in a country far away, a cunning politician suggested a way to reduce crime. He stated that a simple test that could be used to catch all the criminals. When tested, all the criminals would fail the test and be locked up. There’d be no need for expensive courts, crooked lawyers or long drawn out trials. The politician failed to give details of the test when pressed by journalists, stating that the test was very sensitive and they wouldn’t understand it. His supporters soon had their way and the politician was elected to office. On his first day in office, he deployed his national program of criminality-testing. Inevitably the details of the test leaked out. The test was simple and was indeed capable of ensuring 100% of criminals were detected. The test was: If the person is alive, find them guilty and lock them up. The test had a sensitivity of 100%, every single actual... real... bonafide criminal would fail the test and find themselves in prison. Unfo...

Fire Tower Tests & The GRIM Test.

Sometimes finding out why something is broken is a long and painful process. You might have to trawl through a tonne of data, logs or equipment. Filtering out what looks OK from what looks, suspect. These laborious investigative tasks are in the back of our mind when we’re asked to do a code review, test some new code or a pull request. What people often forget is that finding out if something is broken is completely different from finding out why it's broken. Being quick and efficient at finding broken stuff often takes a different approach to the task of clarifying the causes. A failure to test efficiently is therefore often a failure in imagination, a difficulty in creating these new techniques. Kettlefoot Fire Lookout Tower atop Doe Mountain, Johnson County, Tennessee Take fire towers, for example, large forests in places like Canada and the US used to have large networks of Fire towers. These steel structures literally towered over the neighbouring landscape...