Investigating Software

Investigating Software

Thursday, 10 November 2016

Being a square keeps you from going around in circles.

After a weary few hours sorting through, re-running and manually double checking the "automated test" results, the team decide they need to "run the tests again!", that's a problem to the team. Why? because they are too slow. The 'test' runs take too long and they won't have the results until tomorrow.

How does our team intend to fix the problem? ... make the tests run faster. Maybe use a new framework, get better hardware or some other cool trick.
The team get busy, update the test tools and soon find them selves in a similar position. Now of course they need to rewrite them in language X or using a new [A-Z]+DD methodology. I can't believe you are still using technology Z , Luddites!

Updating your tooling, and using a methodology appropriate to your context makes sense and should be factored into your workflow and estimates. But the above approach to solving the problem, starts with the wrong problem. As such, its not likely to find the right answers
.
The team are spending hours unpicking the test results. The results can't be trusted and need to be rerun or manually reviewed. They are the problems. Until you address the reliability, accuracy and precision of the automated checks they will always be a major source of failure demand

That dream of freeing up the team to move quicker or let the testers do more exploratory or security focused testing will remain a dream - while the team spend excessive time picking through the bones of your test results.

Your "automated tests" are a measuring tool. They help you measure the quality of your app. Imagine if your ruler reported a different length every 3rd time you used it! You'd blame the ruler and build or buy a better ruler. Rather than bemoan the time is takes to get an accurate measurement - while re-measuring objects to get "best of three!".

Try fixing or just disabling the flaky tests. Test your automated tests. Don't "create a failing test then see it pass" - investigate whether it was failing for the right reasons and then passing for the right reasons. Speak to your team mates e.g.: "How can I create Problem X realistically to check that my tests pick it up reliably?"

Do you hear these sort of conversations in your team? If so, then your team might need some coaching.

Sunday, 16 October 2016

A Good Run!


“We got a good run from the tests” the tester stated.
“So what’s the story?” the scrum master asked.
“85% Pass” comes the reply, meekly.
“OK, just need to fix that 5% then.” The scrum master announces before striding off to announce that the team is only a couple of % away from success.

Our tester takes a moment to try and process the exchange…


Firstly, their own words:
“We got a good run”
Why had they said that? Well - in a sense - it was true. They had executed the tests before, and they had returned a much higher failure rate. But the code being checked was the same...

OK, so there were at least 3 obvious ways to interpret the data.
  1. The app code meets the criteria checked by the tests. ( Based on test run 2 )
  2. The app code does not meet the criteria checked by the tests. ( Based on test run 1 )
  3. The tests are as reliable a the toss of the coin. ( Based on both test runs )

Its surprising how unlikely people are to choose (3).


Secondly, the scrum master’s words:
“just need to fix that 5%”
Our tester assumes this relates to the de-facto “threshold” that is usually considered as good enough to release. As if the results were a linear scale, such as height or weight. If your code gets over 90% then it gets to pass the gate and get on the release roller-coaster.

The threshold tends to be arbitrary, I worked with a client that thought 86% was good but 83% was just not fit for purpose! Their use tends to indicate a problem. Why are we caring about a number rather than a possibly broken feature? What features or risks do the failing 10% represent? Why do we have so many routine failures?

Do you hear these sort of conversations in your team? If so, then your team might need some coaching.

Monday, 10 October 2016

Programmers & Testers, two roles divided by a common skill-set.

When we switch people from programming to testing and vice versa may reduce the quality of our software.


I’ll get some quick objections out of the way:
  1. But, A person can be a great tester and programmer.  - Yes I agree.
  2. But, Programmers do a lot of good testing. - Yes I agree.
None of the above are in conflict with my conjecture.

Programming or writing software automates things that would be expensive or hard to do otherwise. Our software might also be faster or less error prone at doing whatever it replaces. It may achieve or enable something that couldn't be done before the software was employed.

Writing software involves overcoming and working around constraints to achieve improvement. That is, looking for ways to bypass limitations imposed by external factors or limitations in the tools we use. For example, coping with a high latency internet connection, legacy code or poor quality inputs. A programmer might say they were taking advantage of the technologies’ features to create a faster/more-stable system. A skilled and experienced programmer has learnt to deal with complexity and produce reliable and effective software. That's why we hire them.

A good example might be WhatsApp. Similar messaging systems existed before WhatsApp. But WhatsApp brought together the platform (mobile iOS & Android), cost (free at point of use), security (e2e encryption) and ease of use that people wanted. These features were tied together and the complexities and niggles were smoothed over. For example, skilled programmers automated address book integration and secure messaging instead of a user having to try and use multiple apps to achieve the same.

But the complexities or constraints are often leads to bugs. Leads that are easy to not fully appreciate. The builder's approach: It does that so I need to do this - can override the more investigative approach of - Puzzling over what does a systems apparent behaviour means about its underlying algorithm or supporting libraries? A good tester hypothesizes about what behaviours might be possible from the software. For example: Could we get the app to do an unwanted thing or the right thing at wrong time?

Alternatively a tester may observe the software in action, but bear in mind that certain symptoms may be caused by the constraints within the software or its construction. The psychological ‘priming’ can make them more likely to spot such issues during the course of their examination of the app.

A common response at this point in the debate is, “The person writing the app/automated tests/etc would be able to read the code and see directly what the algorithm is!” But, that argument is flawed for 2 main reasons:

  1. Many testers can read and write good code. This sort of investigation is and always has been an option - whether we are currently implementing the app or an ‘automated test’ or neither. The argument is often a straw man suggesting that all testers can not write (and therefore can’t read ) code.
  2. In a system of any reasonable complexity, there are situations where it’s easier to ascertain a system’s actual behaviour empirically. An attempt to judge its behaviour by purely examining the code is likely to miss a myriad of bugs caused by 3rd party libraries, environmental issues  and a mish-mash of different programmers work.

For example...

Programmers:


Programmers can and do - test their own and colleagues code. They do so as programmers, focused on those very difficult constraints. One of the key constraints is time. Not just, can the code handle time zones etc. But, how long do I have to implement this? What can I achieve in that time? Can I refactor that dodgy library code? Or do I just ‘treat it as good’? Time and the other constraints guide the programmer down a different testing road. Their rationed time leads them to focus on what can be delivered. Their focus is on whether their code met the criteria in the ticket given the complexities that they had to overcome.

A classic symptom of this is the congruence bias, programmers implement code and tests that ascertain whether the system is functioning as they expect. That’s good. That can tell us the app can achieve what was intended. A good example might be random number generation. A team might be assigned to produce an API that provides a randomiser for other parts of the app. The team had been told the output needed to be securely random. That is very random.

The team, Knowing about such things use their operating system’s built in features to generate the numbers. (For example on Linux that might be /dev/random ). Being belt and braces kind of people, they would implement some unit tests that would perform a statistical analysis of their function. This would likely pass with every build once they had fixed the usual minor bugs and all would be good.

Testers:


Luckily for the above team, they had a tester. That tester loved a challenge. Of course she checked the randomness of the system, and yes that looked OK. She also checked the code in conjunction with other systems, and again the system worked OK. The tester also checked if the code fast enough, and once again the system was fine. The tester then set up a test system with a high load. Boom. The log was full of timeout errors, performance was now atrocious and she knew she had struck gold. A little investigation would show that some operating system random number generators are ‘blocking’.

A blocking algorithm will cause subsequent requests to be queued (‘blocked’) until its predecessor has finished. Even if the algorithm is fast, there will be a tipping point when suddenly more requests are coming in than can be serviced. At that point the number of successful requests (per second, for example) will cease to keep up with demand. Typically we might expect a graph of the requests being effectively handled by our system to show a plateau at this point.

Our tester, had double checked the code could do the job. But also questioned the code in ways the team had not thought to look for. Given that there are tools and techniques to aid the measurement of randomness, this confirmatory step would likely be relatively short. A greater period of time would likely have been spent investigating [areas that at the time were ] unknowns. Therefore, The question is less can the tester read the code or validated that it performs how we predicted. The question is more can they see what the software might have been? How might it fall short? What could or should we have built?

Our tester had a different mindset, she stepped beyond what was specified. We can all do this, but we get better the more we do it. We get better if we train at it. Being a good systems programmer, and training at that - comes at the cost of training in the tester mindset. Furthermore, the two mindsets are poles, each at opposite end of our cognitive abilities. Striving at one skillset might not help with the other. A tester that writes code has great advantages. They can and do create test tools - to tease out the actual behaviour of the system. These programming skills have a investigative focus. They may even have a exploratory or exploitative (think security testing) focus, but not a construction focus.

For those that are screaming BUT THEY COULD HAVE HAD A BDD SCENARIO FOR HIGH LOAD or similar, I’ll remind you of the Hindsight bias ( tl;dr:  “the inclination, after an event has occurred, to see the event as having been predictable”)

While the programmer and the tester often share the same headline skills, e.g. they can program in language X, understand and utilise patterns Y & Z appropriately. They apply these skills differently, to different ends.

The change from tester to programmer is more than a context switch. It's a change in your whole approach. People can do this, but it has a cost. That cost might be paid in slower delivery, bugs missed, or features not implemented.

Tuesday, 2 August 2016

Synecdoche

A common but often unnoticed figure of speech is the synecdoche. When I say “Beijing opened its borders”. We know I mean “The People's Republic of China has opened its borders.”) That’s a Synecdoche, in this case I named part of something (Beijing) to mean the whole (P.R.C.).

Conversely, I might say “Westminster is in turmoil” when anyone with knowledge of British politics will know I mean, “The politicians in the Houses of Parliament are in turmoil”. The reader will know I am not referring to The City of Westminster, a region of London. (Or the place in Canada etc.)

Synecdoche can be a useful and illustrating tool of conversation. Helping to convey the size or importance of the subject or illustrate in more detail a subtlety of the situation. For example: “Beijing opened its Borders” also indicates the power of that country's central government. Some residents of one city in China, can open [or close] the borders of a vast country spanning thousands of miles and comprising over 1.3 billion people.

Synecdoche can also lead to ambiguity, and are particularly dependant on context. For example the same phrase “Westminster is in turmoil” accompanied by a picture of a de-railed train, smoke and ambulances would lead the reader to assume the geographic region of Westminster was being referred to.

Just this sort of language and potential for confusion exists within software development. For example, a Product Owner might ask a team to code a feature for her App. A technical lead would likely know her team will actually: analyse, converse, script, code, test, fix, report, document, review etc. And probably do this across multiple systems before she can agree with the Product Owner that the App’s feature is complete or ‘coded’.

Why don’t technical leads get annoyed by this narrow description of the work? Well, actually they do, all the time. When working as a Scrum Master and Program Manager I frequently had to smooth these sorts of negotiations. Often a technical lead or test lead would take the product owners choice of the word (e.g.: “code” or “develop”) to mean that the work required was not significant. When the Product Owner’s words could have been translated as “do clever stuff to make it happen”.

Product Owners were often not from a programming or testing background. Occasionally they would not use the same jargon as developers or, more often, they used the same terms but with their own meanings. For example, using ‘code’ to mean the whole software development and release process.

While some friction would be caused in circumstances where someone might use the wrong or, to the team, misleading jargon, the team usually adapted. The team might use the jargon between themselves, but then adopt a less ‘technical’ (their words) language style when talking to others. That is, people outside the core team.

Testing also has situations where we frequently say one thing, and rely on context to mean so much more or less. ‘Test automation’ for example. This simple term can covers a range of tools, techniques and even approaches.

In my experience, ‘Test automation’ has for example referred to test data generators or shell scripts. These would check data-outputs were within a valid range, given data-inputs of historical purchases. I have also worked with successful teams where the term test automation meant random input generators combined with a simple run-until-crash check.

Furthermore, I have worked on systems where ‘Test Automation’ results could be red / green / pass / fail style messages reported from a GUI or API based test tool. In another team our results could only have  been usefully discerned with the aid of graphing software. On some projects the skilled expertise of a statistician was required to decide whether our test code had uncovered an issue. On occasion, the term 'Test Automation' could mean several or all of the above.

When talking with my team, I need to be more specific. I, like them, have to be able to describe what I’m doing and why. I could just say “I’m doing test automation” but that would be like a developer stating “I’m doing feature X”.  Having a precise way to describe my work, and how it relates to the work of my team members is valuable and time saving. Not just in the time spent not re-explaining and clarifying concepts. But more importantly, not having to re-do things we thought were complete or correct the first time.

Having the words to describe in detail our work is invaluable. The sorts of things we talk about within a team are jargon heavy E.g.: I need to explain to my team that I’m coding a check for the products UTF-16 surrogate pair handling, to be added to the Continuous Integration process, this might mean we don’t complete a feature this sprint. I may need to clarify that I’m writing a script to be used as an oracle -  to aid our User Interface testing, or ask the programmers to include a testability hook to aid our log file analysis.

The language used to communicate these ideas is important. The language and terms themselves are worthy of at least some discussion. If we as a team are unfamiliar with the terms. Or their differing contextual meanings, we will likely end up very confidently and quietly not knowing what we are doing all day.

Monday, 21 December 2015

Your software sucks (any data you give it)

At 1524h, On the afternoon of January 15th 2009, US Airways Flight 1549 was cleared for takeoff from Runway 4 at New York's La Guardia airport. The airplane carried 150 passengers and 5 flight crew, on a flight to Charlotte Douglas, North Carolina. The Airbus A320's twin CFM56 engines had been serviced just over a month prior to the flight. The plane climbed to a height of 859m (2818 feet) before disaster struck.

Passengers reported hearing several loud bangs and then flames being visible from the engines' exhaust. Shortly thereafter the 2 engines shut-down, robbing the Airbus of thrust and its primary source of electrical power.

At this point the Captain took over from the First officer and between them they spent the next 3 minutes both looking for somewhere to land, while also desperately trying to restart their aircraft's engines.

What Happened?

A flock of birds had crossed the path of the Airbus and several had struck the plane. Both engines had ingested birds and shut-down as a result. A shut-down is the FAA required minimum standard behaviour for a Jet engine.

An Emirates engine after a bird strike.
The safe automatic shut-down of a jet engine is a scenario tested for by engine manufacturers before they can be certified for use.

Worse things might happen, e.g. the broken unbalanced blades might continue blowing air into the fuel rich combustion chamber while red-hot engine fragments are jettisoned outwards into other parts of the fuel-laden plane.

Viewed in that light: a graceful shut-down is not a bad minimum safety requirement.

If we think about it, jet engines need a good deal of testing, after-all they
  • Are mission critical.
  • Work faster than humans can think and react.
  • Are expensive and time consuming to build.
  • Have to be integrated with other complex systems
  • Have to accept un-validated inputs (like birds)
Does any of that sound familiar? That last one in particular is relevant to the field of software development and testing.

Un-validated input? How do they test that?


One of the tests that can be performed on a Jet Engine is to fire frozen poultry into the engine. The engine ingests a turkey at high speed, in an attempt to simulate a bird being sucked into the engine during flight.

Like many technical systems that deal directly with the outside world, software can have serious problems when exposed to unusual inputs. Like the Jet, the point of ingest literally can not be protected - something has to 'process' what’s coming in.

As software testers working with applications that need to handle these situations, we need to learn how to perform our own frozen-turkey-tests and examine how our complex systems handle them.
  • Do they crash?
  • If they crash, is that OK? 
  • What have I learned?
  • What were the side effects?
  • Can I restart it? or is it now 'corrupted' ?
  • What is the likelihood of failure?
The sort of websites we use every day have to accept largely un-validated inputs, they are on the internet and anything our computer can send - they have to deal with.

But surely its just text, right?
If its not 'normal' block it!

That isn’t going to work for long... For example Google has to handle anything you want to find on any website. Even if you accidentally include some right-to-left data in your search:




...Or you want to find out how to do that cool Emoticon on your new Microsoft Surface notebook keyboard... Microsoft.com then needs to handle that query.


...Or you don't want to pay extra on your phone bill just because you used a smiley face in your text message.

These are real world examples of things people use their software for, every day. Hence they are the sort of things we need to test for, lest our users end up going elsewhere or find they are being over charged.

Tools such as No More ASCII can help us test websites, by giving us direct access to a range of Unicode 'code-points' that may cause problems for our software.

The problems can be subtle, more than just something 'not looking right'. The complex nature in which languages are represented in your application can mean that simple things such as measuring the length of a string can fail. (string.online-toolz.com)




Sorting can also fail. If your text is reversed for example if may not render correctly afterwords:


These 2 issues are caused by the website not being able to properly process Unicode text, in particular the UTF-16 flavour of Unicode. Some characters (or Graphemes as they are called in Unicode) are in fact made-up of 2 parts or 'code-points'. So whilst many characters tend to be 1 code-point, some are pairs.  These pairs are referred to as 'surrogate pairs'.

Why does the reverse-string function fail? It appears to be putting the emoticons 2 code-points in reverse order, when it shouldn't. They should be treated together as one character when a reverse or sort is performed. (When the individual code-points in a surrogate pair are swapped, they become meaningless).

How to reverse a UTF-16 text string with a Surrogate Pair in it.

These 'surrogate pairs' cover things like Emoticons or Musical notation etc. While not used widely on computers in North America in the 1960s, therefore  not in ASCII, they are now widely used all around the globe.

Un-validated text input is great example of where tools-assisted-testing can discover a wealth of knowledge about our applications. Given the wide domain of possible inputs and unknown-complexity of the app, this is an inherently exploratory process. Have the right tools on-hand helps you gain that knowledge quicker.

You can read more about how to explore how your browser/app handles Unicode.

Thursday, 10 December 2015

Even the errors are broken!

An amused but slightly exasperated developer once turned to me and said "I not only have to get all the features correct, I have to get the errors correct too!". He was referring to the need to implement graceful and useful failure behaviour for his application.

Rather than present the customer or user with an error message or stack trace - give them a route to succeed in their goal. E.g. Find the product they seek or even buy it.
Bing Suggestions demonstrates ungraceful failure.

Graceful failure can take several forms, take a look at this Bing [search] Suggestions bug in Internet Explorer 11.

As you can see, the user is presented with a useful feature, most of the time. But should they paste a long URL into the location bar - They get hit with an error message.

There are multiple issues here. What else is allowing this to happen to the user? The user is presented with an error message - Why? What could the user possibly do with it? Bing Suggestions does not fail gracefully.
I not only have to get all the features correct, I have to get the errors correct too!  -Developer
In this context, presenting the user with an error message is a bug, probably worse than the fact the suggestions themselves don't work. If they silently failed - the number users who were consciously affected would probably be greatly reduced.

By causing the software to fail, we often appear to be destructive, but again we are learning more about the application, through its failure. Handling failures gracefully is another feature of the software that is important to real users - in the real world. The user wants to use your product to achieve their goal. They don't want to see every warning light that displays in the pilot's cockpit. Just tell them if they need to put their seat-belt on.

Monday, 7 December 2015

Counting Images, a FireFox Add-on

Many of my clients ask me to test their content management and processing systems. Often this involves investigating how the software handles images of various sizes as well as text of various lengths or types.

To help create test-images, I created this little FireFox Add-on. The Counting Images add-on starts with one click and can be used to create an image of a custom size.

For example: if you need a 300x250 MPU advert image - just enter 300 and 250 into the panel and click Create Image. To download the image, just click on it - as you would a a link and choose Save.

The image files are named widthxheight.png, and include markings to help identify if they have been truncated e.g.:



The marked numbers refer to the size in pixels of the rectangle they are in. E.g.: the blue rectangle (always the outermost one) is 150x100 pixels in size.

Another example:
As you can see the rectangles start at the defined size and count down in steps of 20 pixels.

What could go wrong? Well a good example is very thing and tall images. The image edge might actually truncate the text specifiying its height e.g.:


 The image here is 30x1001 but the narrowness means the visible text is 30x100.