Skip to main content

Fire Tower Tests & The GRIM Test.



Sometimes finding out why something is broken is a long and painful process. You might have to trawl through a tonne of data, logs or equipment. Filtering out what looks OK from what looks, suspect.

These laborious investigative tasks are in the back of our mind when we’re asked to do a code review, test some new code or a pull request.

What people often forget is that finding out if something is broken is completely different from finding out why it's broken. Being quick and efficient at finding broken stuff often takes a different approach to the task of clarifying the causes. A failure to test efficiently is therefore often a failure in imagination, a difficulty in creating these new techniques.



File:Kettlefoot Fire Lookout Tower atop Doe Mountain, Johnson County, Tennessee.jpg
Kettlefoot Fire Lookout Tower atop Doe Mountain, Johnson County, Tennessee

Take fire towers, for example, large forests in places like Canada and the US used to have large networks of Fire towers. These steel structures literally towered over the neighbouring landscape, providing a perfect vantage point for observers to spot and locate fires.

The towers made no attempt to diagnose the cause, monitor each tree, enforce bans on open fires or dictate how people used the forest. They just made it really easy to spot and locate smoke. 
They’ve taken the worst aspects of old scripted manual tests and shoehorned them into code. ...they made the movie just like the book, and it's painful to watch.
They did not catch every fire, but when they did see smoke people knew for sure that something was burning. That's something the local loggers and townspeople wanted to know about.

As software developers, we often get lost in the woods and try to check every tree for every kind of problem. Many teams take pride in the fact they can churn out large volumes of regression or integration tests for all their new features. 

For some, normal has come to be the writing out hundreds of (formerly manually executed) test cases in their chosen programming language. Automated, job-done, next ticket! They’ve taken the worst aspects of old scripted manual tests and shoehorned them into code. in doing so they made the movie just like the book, and it's painful to watch.

Another approach is to use software ‘fire tower’, relatively simple tests that can see that something is busted, at a glance. Like my Cribbage example in my last post, where it's quicker and simpler to scan over a large number of inputs checking for the wrong result than it is to exhaustively check each possible correct combination.

Another example is the GRIM test. Developed by academics Heathers & Brown, it's a quick and easy way to spot if some summary statistics are incorrect. The following video explains how it works:




In summary, it's a useful way to check if a mean and sample size is consistent. It will highlight if they are definitely wrong in some way, but it can’t tell you if it's correct. What's clever is that you don’t have to see the raw data. If you have a mean of 100 integers, you don’t need to know those 100 numbers to know if there might be a problem. Like the fire tower, it won’t catch every fire, but if it spots a problem then there's definitely a bug somewhere in there.

This could make your tests much quicker or allow them to work on existing data produced by other tests. For example, extracting data via a GUI can be slow. So we often have to simplify the tests to make them quick.

If instead of reading every raw data value to find the mean, you could check the average by using the sample size and the GRIM test. While you would still need to check that the results are generally correct, being able to check quickly if big data sets are broken could save you and your tests considerable time.

I’ve created a python package you can use to incorporate the GRIM test into your Python test code. It's capable of handling the full range of decimal rounding methods available in Python 3, and can even return a summary of which types of rounding deliver a consistent result.

Comments

Popular posts from this blog

Can Gen-AI understand Payments?

When it comes to rolling out updates to large complex banking systems, things can get messy quickly. Of course, the holy grail is to have each subsystem work well independently and to do some form of Pact or contract testing – reducing the complex and painful integration work. But nonetheless – at some point you are going to need to see if the dog and the pony can do their show together – and its generally better to do that in a way that doesn’t make millions of pounds of transactions fail – in a highly public manner, in production.  (This post is based on my recent lightning talk at  PyData London ) For the last few years, I’ve worked in the world of high value, real time and cross border payments, And one of the sticking points in bank [software] integration is message generation. A lot of time is spent dreaming up and creating those messages, then maintaining what you have just built. The world of payments runs on messages, these days they are often XML messages – and they ...

What possible use could Gen AI be to me? (Part 1)

There’s a great scene in the Simpsons where the Monorail salesman comes to town and everyone (except Lisa of course) is quickly entranced by Monorail fever… He has an answer for every question and guess what? The Monorail will solve all the problems… somehow. The hype around Generative AI can seem a bit like that, and like Monorail-guy the sales-guy’s assure you Gen AI will solve all your problems - but can be pretty vague on the “how” part of the answer. So I’m going to provide a few short guides into how Generative (& other forms of AI) Artificial Intelligence can help you and your team. I’ll pitch the technical level differently for each one, and we’ll start with something fairly not technical: Custom Chatbots. ChatBots these days have evolved from the crude web sales tools of ten years ago, designed to hoover up leads for the sales team. They can now provide informative answers to questions based on documents or websites. If we take the most famous: Chat GPT 4. If we ignore the...

Manumation, the worst best practice.

There is a pattern I see with many clients, often enough that I sought out a word to describe it: Manumation, A sort of well-meaning automation that usually requires frequent, extensive and expensive intervention to keep it 'working'. You have probably seen it, the build server that needs a prod and a restart 'when things get a bit busy'. Or a deployment tool that, 'gets confused' and a 'test suite' that just needs another run or three. The cause can be any number of the usual suspects - a corporate standard tool warped 5 ways to make it fit what your team needs. A one-off script 'that manager' decided was an investment and needed to be re-used... A well-intended attempt to 'automate all the things' that achieved the opposite. They result in a manually intensive - automated process, where your team is like a character in the movie Metropolis, fighting with levers all day, just to keep the lights on upstairs. Manual-automation, manu...