Investigating Software

Investigating Software

Tuesday, 11 April 2017

Why you might need testers


I remember teaching my son to ride his bike. No, Strike that, Helping him to learn to ride his bike. It’s that way round – if we are honest – he was changing his brain so it could adapt to the mechanism and behaviour of the bike. I was just holding the bike, pushing and showering him with praise and tips.

If he fell, I didn’t and couldn’t change the way he was riding the bike. I suggested things, rubbed his sore knee and pointed out that he had just cycled more in that last attempt – than he had ever managed before - Son this is working, you’re getting it.

I had help of course, Gravity being one. When he lost balance, it hurt. Not a lot, but enough for his brain to get the feedback it needed to rewire a few neurons. If the mistakes were subtler, advice might help – try going faster – that will make the bike less wobbly. The excitement of going faster and better helped rewire a few more neurons.

When we have this sort of immediate feedback we learn quicker, we improve our game. When the feedback is about problems we don’t even know might exist, that expands our knowledge letting us adapt before its painful to do so.

My son had his own plan, skill and will. He could have learned on his own. It probably would have taken longer and maybe been more painful or even dangerous. Working as a team, we achieved something, more than we could alone.

In software development that’s how successful and sustainable teams work. We don’t have gravity, so we need other sources of feedback. Some of those might be ‘automated tests’, but much like my sons own ‘plan’ they only covered predicted behaviour. “So I just peddle and avoid the trees - Dad, it won’t be hard”.  We need another mind and experience to help us see outside our view of the problems. “Stay off the grass as well, it’s harder and more unstable there”.

That’s where skilled testers come in. Some of us are highly technical and will highlight technical issues or opportunities you haven’t noticed. Some of us are skilled at seeing the app from the customers view point and highlighting problems that will cost you customers. Many of us do the above and much more. This could be letting you know if we are breaking the law. We often write tools that enable us to test more things – better. Testers can be useful for just pointing out the simple stuff like – we are developing for Red Hat Linux, but 60% of our users are on Windows, and 25% are on Apple Macs.

“Our analytics will tell us what the customers are doing! We can tell if a feature or bug matters from the metrics!” Great idea. Do the analytics work well enough? Do you know if a user experiences an error? Do your analytics double-report or under-report? A second pair of eyes will probably find the analytics probably don’t work as well as we’d hoped.


Monday, 20 February 2017

Thank you for finding the bug I missed.

Thank you to the colleague/customer/product owner, who found the bug I missed. That oversight, was (at least in part) my mistake. I've been thinking about what happened and what that means to me and my team.

I'm happy you told me about the issue you found, because you...

1) Opened my eyes to a situation I'd never have thought to investigate.

2) Gave me another item for my checklist of things to check in future.

3) Made me remember, that we are never done testing.

4) Are never sure if the application 'works' well enough.

5) Reminded me to explore more and build less.

6) To request that we may wish to assign more time to finding these issues.

7) Let me experience the hindsight bias, so that the edge-case now seems obvious!

Monday, 16 January 2017

Google Maps Queue Jumps.


Google Maps directs me to and from my client sites. I've saved the location of the client's car parks, when I start the app in the morning - it knows where I want to go. When I start it at the end of the day, Google knows where I want to go.

This is great! It guides me around traffic jams, adjusts when I miss a turn and even offers faster routes en-route as they become available.

But sometimes Google Maps does something wrong. I don't mean incorrect, like how it sometimes gets a street name wrong (typically in a rural area). I don't mean how its GPS fix might put me in a neighbouring street (10m to my left - when there are trees overhead).

I mean wrong - As in something unfair and socially unacceptable. An action, that if a person did it, would be frowned upon.

Example:

Let’s assume a road has a traffic jam, so instead of the cars doing around 60 mph, we are crawling at <10 mph.

In the middle of this traffic jam, the road has a junction, an example is shown here:

Click to enlarge.

Google Maps, using its algorithm/AI, directs me off at exit (A), but rather than finding an alternative route it directs back down on to the road at point (B).

Google Maps has queue-jumped. Google's decision was reasonable and met its goals. Goals, That I assume include reducing journey time (for me). It has bypassed approx 1/4 mile of queuing cars.

It’s an intriguing issue, for a number of reasons:
  •  In cultures where queues are not expected, it might be OK (it’s a clever optimisation!)
  •  Conversely, some cultures may consider it a bug.
  •  Could we even alter/educate the algorithm to reliably distinguish between some routes (short-cuts) and others (queue-jumps)?
  •  What else might the AI deem OK, that I would consider wrong?
  •  What are Google Maps goals? Are they all in my interest and safety?
  •  These seem very far from pass/fail scenarios

For example, What if Google noticed that users used google, YouTube or its adverts more after a certain route, than if they had taken another?

What if that advert-hungry route was slower? Or more dangerous? (E.g. people use google more after that route, as they need to get their car repaired.)


Are these issues being tested for? Are the right questions being asked?

Thursday, 10 November 2016

Being a square keeps you from going around in circles.

After a weary few hours sorting through, re-running and manually double checking the "automated test" results, the team decide they need to "run the tests again!", that's a problem to the team. Why? because they are too slow. The 'test' runs take too long and they won't have the results until tomorrow.

How does our team intend to fix the problem? ... make the tests run faster. Maybe use a new framework, get better hardware or some other cool trick.
The team get busy, update the test tools and soon find them selves in a similar position. Now of course they need to rewrite them in language X or using a new [A-Z]+DD methodology. I can't believe you are still using technology Z , Luddites!

Updating your tooling, and using a methodology appropriate to your context makes sense and should be factored into your workflow and estimates. But the above approach to solving the problem, starts with the wrong problem. As such, its not likely to find the right answers
.
The team are spending hours unpicking the test results. The results can't be trusted and need to be rerun or manually reviewed. They are the problems. Until you address the reliability, accuracy and precision of the automated checks they will always be a major source of failure demand

That dream of freeing up the team to move quicker or let the testers do more exploratory or security focused testing will remain a dream - while the team spend excessive time picking through the bones of your test results.

Your "automated tests" are a measuring tool. They help you measure the quality of your app. Imagine if your ruler reported a different length every 3rd time you used it! You'd blame the ruler and build or buy a better ruler. Rather than bemoan the time is takes to get an accurate measurement - while re-measuring objects to get "best of three!".

Try fixing or just disabling the flaky tests. Test your automated tests. Don't "create a failing test then see it pass" - investigate whether it was failing for the right reasons and then passing for the right reasons. Speak to your team mates e.g.: "How can I create Problem X realistically to check that my tests pick it up reliably?"

Do you hear these sort of conversations in your team? If so, then your team might need some coaching.

Sunday, 16 October 2016

A Good Run!


“We got a good run from the tests” the tester stated.
“So what’s the story?” the scrum master asked.
“85% Pass” comes the reply, meekly.
“OK, just need to fix that 5% then.” The scrum master announces before striding off to announce that the team is only a couple of % away from success.

Our tester takes a moment to try and process the exchange…


Firstly, their own words:
“We got a good run”
Why had they said that? Well - in a sense - it was true. They had executed the tests before, and they had returned a much higher failure rate. But the code being checked was the same...

OK, so there were at least 3 obvious ways to interpret the data.
  1. The app code meets the criteria checked by the tests. ( Based on test run 2 )
  2. The app code does not meet the criteria checked by the tests. ( Based on test run 1 )
  3. The tests are as reliable a the toss of the coin. ( Based on both test runs )

Its surprising how unlikely people are to choose (3).


Secondly, the scrum master’s words:
“just need to fix that 5%”
Our tester assumes this relates to the de-facto “threshold” that is usually considered as good enough to release. As if the results were a linear scale, such as height or weight. If your code gets over 90% then it gets to pass the gate and get on the release roller-coaster.

The threshold tends to be arbitrary, I worked with a client that thought 86% was good but 83% was just not fit for purpose! Their use tends to indicate a problem. Why are we caring about a number rather than a possibly broken feature? What features or risks do the failing 10% represent? Why do we have so many routine failures?

Do you hear these sort of conversations in your team? If so, then your team might need some coaching.

Monday, 10 October 2016

Programmers & Testers, two roles divided by a common skill-set.

When we switch people from programming to testing and vice versa may reduce the quality of our software.


I’ll get some quick objections out of the way:
  1. But, A person can be a great tester and programmer.  - Yes I agree.
  2. But, Programmers do a lot of good testing. - Yes I agree.
None of the above are in conflict with my conjecture.

Programming or writing software automates things that would be expensive or hard to do otherwise. Our software might also be faster or less error prone at doing whatever it replaces. It may achieve or enable something that couldn't be done before the software was employed.

Writing software involves overcoming and working around constraints to achieve improvement. That is, looking for ways to bypass limitations imposed by external factors or limitations in the tools we use. For example, coping with a high latency internet connection, legacy code or poor quality inputs. A programmer might say they were taking advantage of the technologies’ features to create a faster/more-stable system. A skilled and experienced programmer has learnt to deal with complexity and produce reliable and effective software. That's why we hire them.

A good example might be WhatsApp. Similar messaging systems existed before WhatsApp. But WhatsApp brought together the platform (mobile iOS & Android), cost (free at point of use), security (e2e encryption) and ease of use that people wanted. These features were tied together and the complexities and niggles were smoothed over. For example, skilled programmers automated address book integration and secure messaging instead of a user having to try and use multiple apps to achieve the same.

But the complexities or constraints are often leads to bugs. Leads that are easy to not fully appreciate. The builder's approach: It does that so I need to do this - can override the more investigative approach of - Puzzling over what does a systems apparent behaviour means about its underlying algorithm or supporting libraries? A good tester hypothesizes about what behaviours might be possible from the software. For example: Could we get the app to do an unwanted thing or the right thing at wrong time?

Alternatively a tester may observe the software in action, but bear in mind that certain symptoms may be caused by the constraints within the software or its construction. The psychological ‘priming’ can make them more likely to spot such issues during the course of their examination of the app.

A common response at this point in the debate is, “The person writing the app/automated tests/etc would be able to read the code and see directly what the algorithm is!” But, that argument is flawed for 2 main reasons:

  1. Many testers can read and write good code. This sort of investigation is and always has been an option - whether we are currently implementing the app or an ‘automated test’ or neither. The argument is often a straw man suggesting that all testers can not write (and therefore can’t read ) code.
  2. In a system of any reasonable complexity, there are situations where it’s easier to ascertain a system’s actual behaviour empirically. An attempt to judge its behaviour by purely examining the code is likely to miss a myriad of bugs caused by 3rd party libraries, environmental issues  and a mish-mash of different programmers work.

For example...

Programmers:


Programmers can and do - test their own and colleagues code. They do so as programmers, focused on those very difficult constraints. One of the key constraints is time. Not just, can the code handle time zones etc. But, how long do I have to implement this? What can I achieve in that time? Can I refactor that dodgy library code? Or do I just ‘treat it as good’? Time and the other constraints guide the programmer down a different testing road. Their rationed time leads them to focus on what can be delivered. Their focus is on whether their code met the criteria in the ticket given the complexities that they had to overcome.

A classic symptom of this is the congruence bias, programmers implement code and tests that ascertain whether the system is functioning as they expect. That’s good. That can tell us the app can achieve what was intended. A good example might be random number generation. A team might be assigned to produce an API that provides a randomiser for other parts of the app. The team had been told the output needed to be securely random. That is very random.

The team, Knowing about such things use their operating system’s built in features to generate the numbers. (For example on Linux that might be /dev/random ). Being belt and braces kind of people, they would implement some unit tests that would perform a statistical analysis of their function. This would likely pass with every build once they had fixed the usual minor bugs and all would be good.

Testers:


Luckily for the above team, they had a tester. That tester loved a challenge. Of course she checked the randomness of the system, and yes that looked OK. She also checked the code in conjunction with other systems, and again the system worked OK. The tester also checked if the code fast enough, and once again the system was fine. The tester then set up a test system with a high load. Boom. The log was full of timeout errors, performance was now atrocious and she knew she had struck gold. A little investigation would show that some operating system random number generators are ‘blocking’.

A blocking algorithm will cause subsequent requests to be queued (‘blocked’) until its predecessor has finished. Even if the algorithm is fast, there will be a tipping point when suddenly more requests are coming in than can be serviced. At that point the number of successful requests (per second, for example) will cease to keep up with demand. Typically we might expect a graph of the requests being effectively handled by our system to show a plateau at this point.

Our tester, had double checked the code could do the job. But also questioned the code in ways the team had not thought to look for. Given that there are tools and techniques to aid the measurement of randomness, this confirmatory step would likely be relatively short. A greater period of time would likely have been spent investigating [areas that at the time were ] unknowns. Therefore, The question is less can the tester read the code or validated that it performs how we predicted. The question is more can they see what the software might have been? How might it fall short? What could or should we have built?

Our tester had a different mindset, she stepped beyond what was specified. We can all do this, but we get better the more we do it. We get better if we train at it. Being a good systems programmer, and training at that - comes at the cost of training in the tester mindset. Furthermore, the two mindsets are poles, each at opposite end of our cognitive abilities. Striving at one skillset might not help with the other. A tester that writes code has great advantages. They can and do create test tools - to tease out the actual behaviour of the system. These programming skills have a investigative focus. They may even have a exploratory or exploitative (think security testing) focus, but not a construction focus.

For those that are screaming BUT THEY COULD HAVE HAD A BDD SCENARIO FOR HIGH LOAD or similar, I’ll remind you of the Hindsight bias ( tl;dr:  “the inclination, after an event has occurred, to see the event as having been predictable”)

While the programmer and the tester often share the same headline skills, e.g. they can program in language X, understand and utilise patterns Y & Z appropriately. They apply these skills differently, to different ends.

The change from tester to programmer is more than a context switch. It's a change in your whole approach. People can do this, but it has a cost. That cost might be paid in slower delivery, bugs missed, or features not implemented.

Tuesday, 2 August 2016

Synecdoche

A common but often unnoticed figure of speech is the synecdoche. When I say “Beijing opened its borders”. We know I mean “The People's Republic of China has opened its borders.”) That’s a Synecdoche, in this case I named part of something (Beijing) to mean the whole (P.R.C.).

Conversely, I might say “Westminster is in turmoil” when anyone with knowledge of British politics will know I mean, “The politicians in the Houses of Parliament are in turmoil”. The reader will know I am not referring to The City of Westminster, a region of London. (Or the place in Canada etc.)

Synecdoche can be a useful and illustrating tool of conversation. Helping to convey the size or importance of the subject or illustrate in more detail a subtlety of the situation. For example: “Beijing opened its Borders” also indicates the power of that country's central government. Some residents of one city in China, can open [or close] the borders of a vast country spanning thousands of miles and comprising over 1.3 billion people.

Synecdoche can also lead to ambiguity, and are particularly dependant on context. For example the same phrase “Westminster is in turmoil” accompanied by a picture of a de-railed train, smoke and ambulances would lead the reader to assume the geographic region of Westminster was being referred to.

Just this sort of language and potential for confusion exists within software development. For example, a Product Owner might ask a team to code a feature for her App. A technical lead would likely know her team will actually: analyse, converse, script, code, test, fix, report, document, review etc. And probably do this across multiple systems before she can agree with the Product Owner that the App’s feature is complete or ‘coded’.

Why don’t technical leads get annoyed by this narrow description of the work? Well, actually they do, all the time. When working as a Scrum Master and Program Manager I frequently had to smooth these sorts of negotiations. Often a technical lead or test lead would take the product owners choice of the word (e.g.: “code” or “develop”) to mean that the work required was not significant. When the Product Owner’s words could have been translated as “do clever stuff to make it happen”.

Product Owners were often not from a programming or testing background. Occasionally they would not use the same jargon as developers or, more often, they used the same terms but with their own meanings. For example, using ‘code’ to mean the whole software development and release process.

While some friction would be caused in circumstances where someone might use the wrong or, to the team, misleading jargon, the team usually adapted. The team might use the jargon between themselves, but then adopt a less ‘technical’ (their words) language style when talking to others. That is, people outside the core team.

Testing also has situations where we frequently say one thing, and rely on context to mean so much more or less. ‘Test automation’ for example. This simple term can covers a range of tools, techniques and even approaches.

In my experience, ‘Test automation’ has for example referred to test data generators or shell scripts. These would check data-outputs were within a valid range, given data-inputs of historical purchases. I have also worked with successful teams where the term test automation meant random input generators combined with a simple run-until-crash check.

Furthermore, I have worked on systems where ‘Test Automation’ results could be red / green / pass / fail style messages reported from a GUI or API based test tool. In another team our results could only have  been usefully discerned with the aid of graphing software. On some projects the skilled expertise of a statistician was required to decide whether our test code had uncovered an issue. On occasion, the term 'Test Automation' could mean several or all of the above.

When talking with my team, I need to be more specific. I, like them, have to be able to describe what I’m doing and why. I could just say “I’m doing test automation” but that would be like a developer stating “I’m doing feature X”.  Having a precise way to describe my work, and how it relates to the work of my team members is valuable and time saving. Not just in the time spent not re-explaining and clarifying concepts. But more importantly, not having to re-do things we thought were complete or correct the first time.

Having the words to describe in detail our work is invaluable. The sorts of things we talk about within a team are jargon heavy E.g.: I need to explain to my team that I’m coding a check for the products UTF-16 surrogate pair handling, to be added to the Continuous Integration process, this might mean we don’t complete a feature this sprint. I may need to clarify that I’m writing a script to be used as an oracle -  to aid our User Interface testing, or ask the programmers to include a testability hook to aid our log file analysis.

The language used to communicate these ideas is important. The language and terms themselves are worthy of at least some discussion. If we as a team are unfamiliar with the terms. Or their differing contextual meanings, we will likely end up very confidently and quietly not knowing what we are doing all day.