Investigating Software

Investigating Software

Thursday, 10 November 2016

Being a square keeps you from going around in circles.

After a weary few hours sorting through, re-running and manually double checking the "automated test" results, the team decide they need to "run the tests again!", that's a problem to the team. Why? because they are too slow. The 'test' runs take too long and they won't have the results until tomorrow.

How does our team intend to fix the problem? ... make the tests run faster. Maybe use a new framework, get better hardware or some other cool trick.
The team get busy, update the test tools and soon find them selves in a similar position. Now of course they need to rewrite them in language X or using a new [A-Z]+DD methodology. I can't believe you are still using technology Z , Luddites!

Updating your tooling, and using a methodology appropriate to your context makes sense and should be factored into your workflow and estimates. But the above approach to solving the problem, starts with the wrong problem. As such, its not likely to find the right answers
.
The team are spending hours unpicking the test results. The results can't be trusted and need to be rerun or manually reviewed. They are the problems. Until you address the reliability, accuracy and precision of the automated checks they will always be a major source of failure demand

That dream of freeing up the team to move quicker or let the testers do more exploratory or security focused testing will remain a dream - while the team spend excessive time picking through the bones of your test results.

Your "automated tests" are a measuring tool. They help you measure the quality of your app. Imagine if your ruler reported a different length every 3rd time you used it! You'd blame the ruler and build or buy a better ruler. Rather than bemoan the time is takes to get an accurate measurement - while re-measuring objects to get "best of three!".

Try fixing or just disabling the flaky tests. Test your automated tests. Don't "create a failing test then see it pass" - investigate whether it was failing for the right reasons and then passing for the right reasons. Speak to your team mates e.g.: "How can I create Problem X realistically to check that my tests pick it up reliably?"

Do you hear these sort of conversations in your team? If so, then your team might need some coaching.

Sunday, 16 October 2016

A Good Run!


“We got a good run from the tests” the tester stated.
“So what’s the story?” the scrum master asked.
“85% Pass” comes the reply, meekly.
“OK, just need to fix that 5% then.” The scrum master announces before striding off to announce that the team is only a couple of % away from success.

Our tester takes a moment to try and process the exchange…


Firstly, their own words:
“We got a good run”
Why had they said that? Well - in a sense - it was true. They had executed the tests before, and they had returned a much higher failure rate. But the code being checked was the same...

OK, so there were at least 3 obvious ways to interpret the data.
  1. The app code meets the criteria checked by the tests. ( Based on test run 2 )
  2. The app code does not meet the criteria checked by the tests. ( Based on test run 1 )
  3. The tests are as reliable a the toss of the coin. ( Based on both test runs )

Its surprising how unlikely people are to choose (3).


Secondly, the scrum master’s words:
“just need to fix that 5%”
Our tester assumes this relates to the de-facto “threshold” that is usually considered as good enough to release. As if the results were a linear scale, such as height or weight. If your code gets over 90% then it gets to pass the gate and get on the release roller-coaster.

The threshold tends to be arbitrary, I worked with a client that thought 86% was good but 83% was just not fit for purpose! Their use tends to indicate a problem. Why are we caring about a number rather than a possibly broken feature? What features or risks do the failing 10% represent? Why do we have so many routine failures?

Do you hear these sort of conversations in your team? If so, then your team might need some coaching.

Monday, 10 October 2016

Programmers & Testers, two roles divided by a common skill-set.

When we switch people from programming to testing and vice versa may reduce the quality of our software.


I’ll get some quick objections out of the way:
  1. But, A person can be a great tester and programmer.  - Yes I agree.
  2. But, Programmers do a lot of good testing. - Yes I agree.
None of the above are in conflict with my conjecture.

Programming or writing software automates things that would be expensive or hard to do otherwise. Our software might also be faster or less error prone at doing whatever it replaces. It may achieve or enable something that couldn't be done before the software was employed.

Writing software involves overcoming and working around constraints to achieve improvement. That is, looking for ways to bypass limitations imposed by external factors or limitations in the tools we use. For example, coping with a high latency internet connection, legacy code or poor quality inputs. A programmer might say they were taking advantage of the technologies’ features to create a faster/more-stable system. A skilled and experienced programmer has learnt to deal with complexity and produce reliable and effective software. That's why we hire them.

A good example might be WhatsApp. Similar messaging systems existed before WhatsApp. But WhatsApp brought together the platform (mobile iOS & Android), cost (free at point of use), security (e2e encryption) and ease of use that people wanted. These features were tied together and the complexities and niggles were smoothed over. For example, skilled programmers automated address book integration and secure messaging instead of a user having to try and use multiple apps to achieve the same.

But the complexities or constraints are often leads to bugs. Leads that are easy to not fully appreciate. The builder's approach: It does that so I need to do this - can override the more investigative approach of - Puzzling over what does a systems apparent behaviour means about its underlying algorithm or supporting libraries? A good tester hypothesizes about what behaviours might be possible from the software. For example: Could we get the app to do an unwanted thing or the right thing at wrong time?

Alternatively a tester may observe the software in action, but bear in mind that certain symptoms may be caused by the constraints within the software or its construction. The psychological ‘priming’ can make them more likely to spot such issues during the course of their examination of the app.

A common response at this point in the debate is, “The person writing the app/automated tests/etc would be able to read the code and see directly what the algorithm is!” But, that argument is flawed for 2 main reasons:

  1. Many testers can read and write good code. This sort of investigation is and always has been an option - whether we are currently implementing the app or an ‘automated test’ or neither. The argument is often a straw man suggesting that all testers can not write (and therefore can’t read ) code.
  2. In a system of any reasonable complexity, there are situations where it’s easier to ascertain a system’s actual behaviour empirically. An attempt to judge its behaviour by purely examining the code is likely to miss a myriad of bugs caused by 3rd party libraries, environmental issues  and a mish-mash of different programmers work.

For example...

Programmers:


Programmers can and do - test their own and colleagues code. They do so as programmers, focused on those very difficult constraints. One of the key constraints is time. Not just, can the code handle time zones etc. But, how long do I have to implement this? What can I achieve in that time? Can I refactor that dodgy library code? Or do I just ‘treat it as good’? Time and the other constraints guide the programmer down a different testing road. Their rationed time leads them to focus on what can be delivered. Their focus is on whether their code met the criteria in the ticket given the complexities that they had to overcome.

A classic symptom of this is the congruence bias, programmers implement code and tests that ascertain whether the system is functioning as they expect. That’s good. That can tell us the app can achieve what was intended. A good example might be random number generation. A team might be assigned to produce an API that provides a randomiser for other parts of the app. The team had been told the output needed to be securely random. That is very random.

The team, Knowing about such things use their operating system’s built in features to generate the numbers. (For example on Linux that might be /dev/random ). Being belt and braces kind of people, they would implement some unit tests that would perform a statistical analysis of their function. This would likely pass with every build once they had fixed the usual minor bugs and all would be good.

Testers:


Luckily for the above team, they had a tester. That tester loved a challenge. Of course she checked the randomness of the system, and yes that looked OK. She also checked the code in conjunction with other systems, and again the system worked OK. The tester also checked if the code fast enough, and once again the system was fine. The tester then set up a test system with a high load. Boom. The log was full of timeout errors, performance was now atrocious and she knew she had struck gold. A little investigation would show that some operating system random number generators are ‘blocking’.

A blocking algorithm will cause subsequent requests to be queued (‘blocked’) until its predecessor has finished. Even if the algorithm is fast, there will be a tipping point when suddenly more requests are coming in than can be serviced. At that point the number of successful requests (per second, for example) will cease to keep up with demand. Typically we might expect a graph of the requests being effectively handled by our system to show a plateau at this point.

Our tester, had double checked the code could do the job. But also questioned the code in ways the team had not thought to look for. Given that there are tools and techniques to aid the measurement of randomness, this confirmatory step would likely be relatively short. A greater period of time would likely have been spent investigating [areas that at the time were ] unknowns. Therefore, The question is less can the tester read the code or validated that it performs how we predicted. The question is more can they see what the software might have been? How might it fall short? What could or should we have built?

Our tester had a different mindset, she stepped beyond what was specified. We can all do this, but we get better the more we do it. We get better if we train at it. Being a good systems programmer, and training at that - comes at the cost of training in the tester mindset. Furthermore, the two mindsets are poles, each at opposite end of our cognitive abilities. Striving at one skillset might not help with the other. A tester that writes code has great advantages. They can and do create test tools - to tease out the actual behaviour of the system. These programming skills have a investigative focus. They may even have a exploratory or exploitative (think security testing) focus, but not a construction focus.

For those that are screaming BUT THEY COULD HAVE HAD A BDD SCENARIO FOR HIGH LOAD or similar, I’ll remind you of the Hindsight bias ( tl;dr:  “the inclination, after an event has occurred, to see the event as having been predictable”)

While the programmer and the tester often share the same headline skills, e.g. they can program in language X, understand and utilise patterns Y & Z appropriately. They apply these skills differently, to different ends.

The change from tester to programmer is more than a context switch. It's a change in your whole approach. People can do this, but it has a cost. That cost might be paid in slower delivery, bugs missed, or features not implemented.

Tuesday, 2 August 2016

Synecdoche

A common but often unnoticed figure of speech is the synecdoche. When I say “Beijing opened its borders”. We know I mean “The People's Republic of China has opened its borders.”) That’s a Synecdoche, in this case I named part of something (Beijing) to mean the whole (P.R.C.).

Conversely, I might say “Westminster is in turmoil” when anyone with knowledge of British politics will know I mean, “The politicians in the Houses of Parliament are in turmoil”. The reader will know I am not referring to The City of Westminster, a region of London. (Or the place in Canada etc.)

Synecdoche can be a useful and illustrating tool of conversation. Helping to convey the size or importance of the subject or illustrate in more detail a subtlety of the situation. For example: “Beijing opened its Borders” also indicates the power of that country's central government. Some residents of one city in China, can open [or close] the borders of a vast country spanning thousands of miles and comprising over 1.3 billion people.

Synecdoche can also lead to ambiguity, and are particularly dependant on context. For example the same phrase “Westminster is in turmoil” accompanied by a picture of a de-railed train, smoke and ambulances would lead the reader to assume the geographic region of Westminster was being referred to.

Just this sort of language and potential for confusion exists within software development. For example, a Product Owner might ask a team to code a feature for her App. A technical lead would likely know her team will actually: analyse, converse, script, code, test, fix, report, document, review etc. And probably do this across multiple systems before she can agree with the Product Owner that the App’s feature is complete or ‘coded’.

Why don’t technical leads get annoyed by this narrow description of the work? Well, actually they do, all the time. When working as a Scrum Master and Program Manager I frequently had to smooth these sorts of negotiations. Often a technical lead or test lead would take the product owners choice of the word (e.g.: “code” or “develop”) to mean that the work required was not significant. When the Product Owner’s words could have been translated as “do clever stuff to make it happen”.

Product Owners were often not from a programming or testing background. Occasionally they would not use the same jargon as developers or, more often, they used the same terms but with their own meanings. For example, using ‘code’ to mean the whole software development and release process.

While some friction would be caused in circumstances where someone might use the wrong or, to the team, misleading jargon, the team usually adapted. The team might use the jargon between themselves, but then adopt a less ‘technical’ (their words) language style when talking to others. That is, people outside the core team.

Testing also has situations where we frequently say one thing, and rely on context to mean so much more or less. ‘Test automation’ for example. This simple term can covers a range of tools, techniques and even approaches.

In my experience, ‘Test automation’ has for example referred to test data generators or shell scripts. These would check data-outputs were within a valid range, given data-inputs of historical purchases. I have also worked with successful teams where the term test automation meant random input generators combined with a simple run-until-crash check.

Furthermore, I have worked on systems where ‘Test Automation’ results could be red / green / pass / fail style messages reported from a GUI or API based test tool. In another team our results could only have  been usefully discerned with the aid of graphing software. On some projects the skilled expertise of a statistician was required to decide whether our test code had uncovered an issue. On occasion, the term 'Test Automation' could mean several or all of the above.

When talking with my team, I need to be more specific. I, like them, have to be able to describe what I’m doing and why. I could just say “I’m doing test automation” but that would be like a developer stating “I’m doing feature X”.  Having a precise way to describe my work, and how it relates to the work of my team members is valuable and time saving. Not just in the time spent not re-explaining and clarifying concepts. But more importantly, not having to re-do things we thought were complete or correct the first time.

Having the words to describe in detail our work is invaluable. The sorts of things we talk about within a team are jargon heavy E.g.: I need to explain to my team that I’m coding a check for the products UTF-16 surrogate pair handling, to be added to the Continuous Integration process, this might mean we don’t complete a feature this sprint. I may need to clarify that I’m writing a script to be used as an oracle -  to aid our User Interface testing, or ask the programmers to include a testability hook to aid our log file analysis.

The language used to communicate these ideas is important. The language and terms themselves are worthy of at least some discussion. If we as a team are unfamiliar with the terms. Or their differing contextual meanings, we will likely end up very confidently and quietly not knowing what we are doing all day.