Investigating Software

Investigating Software

Wednesday, 3 October 2012

Simple test automation, with no moving parts.

Can you see the 74?
This is an Ishihara Color Test. Its used to help diagnose colour blindness, people with certain forms of colour blindness would not be able to read the text contained in the image. The full set of 38 plates would allow a doctor to accurately diagnose the colour-perception deficiencies affecting a patient.

The test is ingenious in its concept, yet remarkably simple in its execution. No complicated lenses, lighting, tools or measuring devices are required. The doctor or nurse can quickly administer the test with a simple and portable pack of cards.

The Ishihara test is an end to end test. Anything, from lighting in the room, to the brain of the patient can influence the result. The examiner will endeavour minimise many of the controllable factors, such as switching off the disco lights, asking the patient to remove their blue tinted sun-glasses and maybe checking they can read normal cards (e.g. your patient might be a child.).

End to end tests like this are messy, many factors can be in-play making classic pre-scripted test automation of minimal use as the burden of coding for the myriad of issues can be prohibitive. Furthermore, despite their underlying-complexity, End to End tests, are often the most valuable – they can tell you if your system can do what your customer is paying for. For example, are your data-entry-inputs making it out to the web? Are they readable by your users?

These Ishihara style tests, are a quick way of analysing that end-to-end view. This is an area I have been looking into recently, here's an example of a Unicode-encoding detection file, as rendered by default in Firefox.

The fact that none of the text is legible, tells us that the text is not being rendered in the common Unicode formats (Known as UTF-8, UTF-16LE or UTF16BE ). This type of rendering problem is known as Mojibake. Depending on your context, That might be expected, as by default HTTP uses an older standard text encoding standard 
(labelled ISO 8859-1, which similar to ASCII).

You can actually change how FireFox and Internet Explorer 'decode' the text for a page. These are menus to do it in FireFox on WIndows 7.

If I change FireFox to use the menu option "Unicode (UTF-16)" character encoding, This is what I see:

Notice the page tells me it is being rendered in UTF-16BE. Our special page has reverse engineered what FireFox browser means by UTF-16. There are in fact 2 types of UTF-16,  BE and LE ( If you are interested, you can find out more about this Big Endian / Little Endian quirk ). That's interesting, why did it use UTF-16BE? Is it using UTF-16’s predecessor: UCS-2’s default ordering of Big-Endian? 

(Don’t worry this stuff IS ACTUALLY CONFUSING.)

If I change FireFox to use what is fast becoming the de-facto standard, UTF-8, the page tells us likewise:

I could do other similar investigations, by checking HTTP headers. I might also examine the page-source and the encoding that has configured. But alas, its not uncommon for these to differ for a given page. So to find out which encoding  is actually being used? The Ishihara tests can help.

Unlike other methods very little setup is required, the files just need to be included in the test system or its data. They are safe and simple - They don’t execute any code at run time and are not prone to many of the usual programming relating maintenance issues.

When might you use Ishihara style tests? Whenever you suspect there is some medium that might be interfering with what you are seeing. For example, If you deploy a new cache in front of your website - it shouldn't change how the pages actually are encoded [should it?]. (Changes in encoding might change a page’s appearance - now you have a quick way to check the actual encoding in-use.)

Remember that end-to-end view? Well if our system has multiple steps - which process or affect our text - then any one of those steps might in theory highlight an issue. So even if viewing our test file suggests it is being treated as UTF-8, this might just mean that for example our back-end content management system processed the file as UTF-8. The next step may have again changed the data to a different encoding. So while we can't always be sure what is affecting the Ishihara test text, we can at least see that something in that black box is affecting it in a visible way.

I've only scratched the surface here with the idea of Ishihara tests. They can provide greater resolution in issues such as character/text encoding e.g. Did that euro symbol display OK? Well we know you are not using ASCII text encoding then etc. The technique can be used elsewhere, have a try yourself. You can download the simple example above.

Monday, 10 September 2012

Cincinnati Test Store

Monday 3rd September 1827, A man steps off the road at the corner of Fifth and Elm, and walks into a store. He's frequented the store a few times since it opened, and he's starting to get to know the owner and his range of merchandise. In fact, like many of people in town he's becoming a regular customer.

He steps up to the counter, both he and the store owner glance at the large clock hanging on the wall and nod in unison. The shop-keeper makes a note of the time, the two then begin a rapid discussion of requirements and how the shop keeper might be able to help. When they've agreed what's needed, the shop keeper prepares the various items, bringing them to the counter, weighed, measured and packaged ready for transport to the customers nearby holding.

The store keeper then presents the bill to the customer, who glances at the clock again, and the prices listed on each of the items arranged around the store's shelves and then pays. The customer smiles as he loads the goods onto his horse, happy that he's gotten a good deal and yet been able to talk over his needs with the store keeper - for the items he knew least about. He also appreciated how his purchases were packed securely. As he was travelling back home that day, the extra cost of packing the goods was worth it given the rough ride they'd likely take on the journey.

The store was the Cincinnati Time Store, and the shop keeper was Josiah Warren. The store was novel, in that it charged customers for the base 'cost' of the items on sale plus a cost for the labour-time involved in getting the item to and serving the customer. The store-keeper might also charge a higher rate for work he considered harder. The store was able to undercut other local stores, and increase the amount of business he was able to transact.

Imagine if software testing was bought and sold in this manner. Many successful software testers here in London are contractors, and already work for short contracts as and when is agreeable to both companies. But even then, the time is usually agreed upfront ie: 3 months. Imagine if that time was available on demand, per hour?

What drivers would this put onto our work? and that of other team members?

You might want constant involvement from your testers, in which case the costs are fairly predictable. But remember, you are paying by the hour, you can stop paying for the testing at the end of each hour. Would you keep the tester being paid for the whole day? week? sprint? even if they were not finding any useful information? If you found that pairing your testers with programmers during full-time was not helping, you can save some money from the pure-programming parts of your plan. Conversely your tester would be motivated to show they could pair and be productive - if they wanted to diversify their skills.

As the tester, I'm now financially motivated to keep finding new information. To keep those questions, success stories, bug reports coming. I'm only as good as my last report. If the product owner thinks she's heard enough and wants to ship - then she can stop the costs any-time, and ship.

The team might also want to hire a couple of testers, rather than just one. The testers might then be directly competing for 'renewal' at the end of the hour. I might advertise myself as a fast tester (or rapid tester) and sell my hours at a higher rate. I might do this because I've learned that my customer cares more for timeliness than cost per hour. For example the opportunity cost of not shipping the product 'soon' might be far greater than the cost of the team members. I'd then be motivated to deliver information quicker and more usefully than my cheaper-slower counterpart. My higher rate could help me earn the same income in less time and help the team deliver more sooner.

Has your team been bitten by test automation systems that took weeks or longer to 'arrive'? and maybe then didn't quite do what you needed? or were flaky? If you were being paid by the hour, you would want to deliver the test automation or more usefully the results it provides in a more timely manner. You'd be immediately financially motivated to deliver actual-test-results, information or bug reports, incrementally from your test automation. If you delivered automation that didn't help, didn't help you provide more and better information each hour, how would you justify that premium hourly rate? What's more agile than breaking my test automation development work into a continuous stream of value adding deliverables ? that will constantly be helping us test better and quicker?

Paying for testing by the hour would not necessarily lead to the unfortunate consequences people imagine when competition is used in the workplace. My fellow tester and I could split the work, maximising our ability to do the best testing we can. If my skills were better suited to testing the applications Java & Unix back-end I'd spend my hour there. Mean-while my colleague uses their expertise in GUI testing and usability to locate investigate an array of front end issues.

Unfortunately a tester might also be motivated to drag out testing and drip feed information back to the team. That's a risk. But a second or third tester in the team could help provide a competitive incentive. Especially if those fellow testers were providing better feedback, earlier. Why keep paying Mr SlowNSteady when when Miss BigNewsFirst has found the major issues after a couple of hours work?

I might also be tempted to turn every meeting into a job justification speech. Product Owners would need to monitor whether this was getting out of hand - and becoming more than just sharing information.

I'm not suggesting this as a panacea for all the ills of software development or even testing in particular. What this kind of thinking does is let you examine what the companies that hire testers - want from testers. What are the customers willing to pay for? What are they willing to pay more for? From my experience, in recent contexts, customers want good information about their new software and they want it quickly - so the system can be either fixed and/or released quickly.

Monday, 14 May 2012

Using test automation to help me test, a Google Elevation API example

Someone once asked me if "Testing a login-process was a good thing to 'automate'?". We discussed the actual testing and checking they were concerned with. Their real concern was that their product's 'login' feature was a fundamental requirement, if that was 'broken' they wanted the team to know quick and to get it fixed quicker. A failure to login was probably going to be a show-stopping defect in the product. Another hope was that they could 'liberate' the testers from testing this functionality laboriously in every build/release etc.

At this point the context becomes relevant, the answers can change depending the team, company and application involved. We have an idea of what the team are thinking - we need to think about why they have those ideas. For example, do we host or own the login/authentication service? if not, how much value is their in testing the actual login-process? Would a mock of that service suffice for our automated checks?

What are we looking for in our automated checks? To see it work? for one user? one user at a time? one type of user at a time? I assume we need to check the inverse of these as well, i.e.: Does it not accept a login for an unacceptable user? otherwise we could easily miss one of the most important tests - do we actually allow and disallow user-logins as required/correctly?

These questions soon start highlight the point at which automation can help, and complement testing. That is to say test automation probably wouldn't be a good idea for testing a user-login. But would probably be a good ideas for testing 100 or 1000 logins or types of login. Your testers will probably have to login to use the system themselves, so will inevitably use and eyeball the login process from a a single user perspective. They will unlikely have the time, or patience to test a matrix of 1000 user logins and permissions. Furthermore, the login-service could take advantage of the features automation can bring. For example the login service could be accessed directly and the login API called in what ever manner the tester desires (sequential, parallel, duplicates, fast, slow, random etc). These tests could not practically be performed by one person, and yet are likely to be realistic usage scenarios.

An investigation using reasoning and test automation such as this, that plays to the computers strengths can have the desired knock-on effect of liberating the tester, can even provide them with intelligence [information] to aid finding out more information or bugs. The questioning about what they want, what they need, what are they working-with, all sprang from their desire to find out about a specific application of test automation.

For example, I recently practiced some exploratory test automation on the Google Maps API, in particular the Elevation API. The service, in exchange for a latitude and longitude values returns an elevation in meters. The API is designed for use in conjunction with the other Google Maps APIs, but can be used directly without login, via a simple URL. If we had to test this system, maybe as a potential customer or if I was working with the developers, how might we do that? How might test automation help?

I start by skim-reading the documentation page, just as much as I need to get started. Firstly, as a tester I can immediately bring some issues to light. I can see the page does not provide an obvious indication of what it means by 'elevation'. Is that elevation above sea level? If so does it refer to height above Mean High Water Spring, as is typical for things such as bridges over the sea or river estuaries. Or is it referring to the height above 'chart datum' a somewhat contrived estimate of a mean low tide. I make a note, These questions might well be important to our team - but not instantly answerable.

There's more information on nautical charts.

The documentation also doesn't readily indicate what survey the data is based on (WGS84, OSGB36 etc ) While this won't cause you much concern for plotting the location and elevation of your local pizza delivery guy. It might cause concern if you are using the system for anything business critical. For example the two systems mentioned; WGS84 and OSGB36 can map the same co-ordinates to locations 70 metres apart. Again, context questions are arising. Who'd use this system? If you are hill walking in England or Scotland, the latter is likely to be the system used by your Ordinance Survey maps. But your handheld GPS system is likely to default to the American GPS convention of WGS84. Again, important questions for our team, what will the information be used with? by whom? Will it be meaningful and accurate when used with other data?

Starting to use the API, as with most software is one of the best ways to find out how it does and does not work. I could easily check to see if a single request will deliver a response, with a command like this, e.g:

curl -s ',1&sensor=false'

I tried a few points, checking the sorts of responses I receive. The responses are JSON by default, indented for readability and the precision of co-ordinates and elevation is to several data decimal points. There again, more questions... Does it need to be human readable? Should we save on bandwidth by leaving out the whitespace? Should the elevation be to 14 decimal point? Here is an example response:

   "results" : [
         "elevation" : 39.87668991088867,
         "location" : {
            "lat" : 50.67643799459280,
            "lng" : -1.235103116128651
         "resolution" : 610.8129272460938
   "status" : "OK"

Were the responses typical? To get a bigger sample of data, I decided to request a series of points, across a large area. I chose the Isle of Wight, an area to the south of England that includes areas above & below sea level and is well charted. If I see any strange results I should be able to get a reference map to compare the data against reasonably easily. I also chose to request the points at random rather than request them sequentially. This would allow me to get an overall impression of the elevations with a smaller sample. It would also help to mitigate any bias I might have in choosing latitude or longitude values. I used Ruby's built-in Rand method to generate the numbers. While not truly random, or as random as those found at, they are likely to be considerably more random than those I might choose myself.

I quickly wrote a simple unix shell script to request single elevation points, for a pair of co-ordinates. The points would be chosen at random within the bounds decided (the Isle of man and surrounds). The script would continue requesting continuously, pausing slightly in between each request to avoid overloading the server and being blocked. The results are each directed to a numbered file. A simple script like this can be quickly written in shell, ruby or similar and left to work in our absence. Its simplicity means maintenance and upfront costs are kept to a minimum. No days or weeks of 'test framework' development or reworking. My script was less than a dozen lines long and was written in minutes.

Left to run in the background, while I focused on other work, the script silently did the mundane work we are not good at, but computers excel at. Using the results of these API requests I hoped to chart the results, and maybe spot some anomalies or erroneous data. I thought they might be easier to 'notice' if presented in graphical form.

Several hours later, I examined the results. This is where unix-commands become particularly useful, I can easily 'grep' every file (in a folder now full of several thousand responses) for any lines of text that contain a given string. I looked at the last few responses from the elevation API, and notice that the server has stopped serving results as I have exceeded the query limit. That is, I have requested more elevation values than are allowed under the services terms-of-service. Thats useful information, I can check whether the server started doing this after the right period of time - and how it calculates that. I now have more questions and even some actual real-data I can re-analyse to help.

Often test automation ignores most of the useful information, and is reduced to a simple Pass/Fail check on one or a handful of pre-defined checks. But if you keep all the data, you can re-process the data at any time. I tend to dump it all to individual files, log files or a even a database. Then you can often re-start analysing the system using the recorded data very quickly, and test your ideas against the real system.

In our Google Elevation API example, Using grep, I quickly scanned every file to see all results that were accepted. The command looked like this:

grep "status" * | grep  -v OVER_QUERY_LIMIT

In half a second the command has searched through over 12 thousand results and presented me with the name of the file and the actual lines that include the 'status' response line. A quick scroll through the results and blink test - highlights that their is in fact another type of result. As well as those that exceeded the query limit, those that were ok, there is a third group that return an UNKNOWN_ERROR. Another quick scan of the documentation shows that this is one of the expected response status for the API. I quickly retry the few responses that failed using the same latitude and longitude values - they worked and returned seemingly valid data. This suggests that these failures were intermittent. The failures indicated a useful point; The system can fail, and unpredictably.

More questions... How reliable is the system? Is it reliable enough for my client?

A quick calculation, based on the the number of requests and failures showed that although I had only seen few failures, that was enough to take the availability of the service from 100% down to just under 99.98% reliability. Thats often considered good, but if for example my client was paying for 4 nines (99.99%), They'd want to know. If only to give them a useful negotiation point in contract renewals. I re-ran this test and analysis later and saw a very similar result, it appears as if this might be a useful estimation of the service's availability.

Using the data I had collected, I wrote a short ruby script that read the JSON responses and outputted a single CSV file containing the latitude, longitude and elevation. Ruby is my preference over shell for this type of task, as it has built-in libraries that make the interpretation of data in XML, JSON and YAML form, almost trivial. I then fed these results in to GNUPlot, a simple and free chart plotting tool. GNUPlot allowed me to easily change the colour of plotted points depending on whether the elevation was positive or negative.

Here's the result:

You can see the outline of the Isle, and even what I suspected were a couple of erroneous data points. Closer examination suggests that these are in fact likely to be correct, as they correspond to channels and bays that are open to the sea. Although this exercise had yet to highlight any issues - it performed as useful function nonetheless. It had let me compare my results against another map visually, checking that I was grabbing and plotting the data at least superficially correctly. I had not for example confused latitude with longitude.

I did notice one thing that was not expected in the resulting map. The cloud of points seemed to lack any obvious distortion compared with other maps I found online. It seemed, too good, especially as I had ignored all I had not used any correction for the map projection. I had taken the 3 dimensional lat and long values and 'flat' projected them - and the result still looked ok.

This illustrates how testing is not so much about finding bugs - but rather about finding information, asking questions. We then use that information to help find more information through more testing. I now suspected the data was set to use a projection that worked well at European latitudes e.g. Mercator, or used some other system to make things look 'right' to me. How might this manifest it self elsewhere in the APIs responses? (Google documentation has more info on projections used etc.)

Thinking back to the 3 dimensional nature of the data, I knew that a point on the globe can be represented by multiple sets of co-ordinates [if we use Latitude & Longitude]. A good example is the North Pole. This has a latitude of 90 degrees, but can have any valid longitude.. I try various co-ordinates for the North Pole, and each returned a different elevation. Thats interesting, my client might be planning to using the system for fairly northern latitudes - will the data be accurate enough? If elevation is unreliable around the pole, at what latitude will it be 'good enough'? If our product owners want more information about just how variable the elevation is at the pole is? Or what is the elevation at the south pole? Those are pretty simple modifications to my short script. (Wikipedia has some interesting comments about Google maps near the poles.)

The simple automation used in this example, combined with a human interpretation used relatively little expensive 'human' time and yet maximised the return of automation. Many 'automation solutions' are quite the reverse; requiring extensive development, maintenance and baby sitting. They typically require specialised environments, machine-estates to be created and maintained by [expensive] people. This is far from actually being automated, the man hours required to keep them running, to interpret the results and rerun the ambiguous failures is often substantial.

The exploratory investigation outlined here greatly improves on the coverage a lone human tester can achieve, and yet is lightweight and simple. The scripts are short and easily understood by testers new to the team. They are written in commonly used languages and can be understood quickly by programmers and system administrators alike. My client won't be locked into using "the only guy that can keep those tests running!" and they can free their staff to work on the product - the product that makes money.

Monday, 19 March 2012

A simple test of time.

Last week I was performing another of my 5 minute testing exercises. As posted before, if I get a spare few minutes I pick something and investigate. This time, I'd picked Google Calendar.

One thing people use calendars for is logging what they have done. That is, they function as both schedulers and record keepers. You add what you planned to do, and they also serve as a record of what you did - useful for invoicing clients or just reviewing how you used your time.

Calendars and software based on them are inherently difficult to program and as such are often a rich source of bugs. People make a lot of assumptions about time and dates. For example that something ends after it starts.

That may sound like something that 'just is true', but there are a number of reasons why that might not be the case. Some examples are:
  • You type in the dates the wrong way round (or mix up your ISO and US dates etc)
  • You're working with times around a DST switch, when 30min after 0130h might be 0100h.
  • The system clock decides to correct itself, abruptly, in the middle of an action (A poorly implemented NTP setup could do this)
Google Calendar is widely used, and has been available for sometime, but I suspected bugs could still be uncovered quickly.

I opened Google Calendar, picked a time that day and added an item: Stuff i did. You can see it above in light-blue.

I then clicked on the item, and edited the date. But butter fingers here, typed in the wrong year. Not only that I type only the year in. So now we get to see how Google calendar handles an event ending before it begins.

Google Calendar appears to have deleted the date. OK, maybe its just deleting what [it assumes] is obviously wrong. But why the hour glass? () What was Google's code doing for so long?

A few moments later, after not being able to click on anything else in Google Calendar, I'm greeted with this:

OK, so if I click yes, thats good right? Otherwise won't I be disabling the Calendar code? A few moments later... The window goes blank...

A little later, the page reappears and you get another chance, and the Calendar starts to give you better warnings. But none-the-less that wasn't a good user experience, and certainly a bug.

These are simple to catch bugs, so I'm often left wondering why they are often present in widely used software that probably had considerable money expended in its development. This bug is quite repeatable and present across different browsers and operating systems. All it took was a little investigation.

Tuesday, 6 March 2012

How to avoid testing in circles.

I once had an interesting conversation with a colleague who worked in a company selling hotel room bookings. The problem was interesting. Their profits depended on many factors. Firstly, fluctuating demand e.g.: Holidays, Weekends, Local events etc. Secondly, varying types of demand e.g. Business customers, Tourists, Single night bookings or e.g.: 11 day holidays. They also had multiple types of contracts on the rooms. For some, they might have had the exclusive right to sell [as they had pre-paid], for others they had an option to sell [at a lower profit] etc.

My naive view had been they priced the room bookings at a suitable mark up, upping that markup for known busy times etc. For example a tourist hotel hotel near the Olympics would be a high mark up, the tourist hotel room in winter would have incurred less of a markup. Better to get some money than none at all).

He smiled and said some places do that, but he didn't. He had realised his team had a bias towards making a healthy profit. "That seems errh good..." I replied, not sure what I was missing. He explained the problem wasn't making a profit. They made a profit, They could do that. There's good enough demand, and limited enough supply in a business and tourist centre like London to make a profit. The problem was maximising profit. What he had done, was present the profit to the team as a proportion of the theoretical maximum profit. That is, the profit given the perfect combination of bookings at peak rates.

It was understood that this was an unlikely, but doable, goal. The benefit was the team could more easily see whether they were making as much money as they could. For example, that business hotel room-stock was making £1 million profit (more than the others), but we should be able to get £10 million (Whereas the others are already at 90% of theoretical max profit.).

This struck me as a useful way of looking at the world. Maybe it was just in-tune with my tester mind-set. In software development, we often try and view what we have achieved, and we see the stories we have completed. And when we come to a release at the end of the week, sprint, month, quarter etc, we test those stories. We also fix and test bugs considered important and we usually regression test the system as a whole. At this point, most teams I've worked with, start trying to regression test - but soon end up retesting the same areas, going in circles around the same code-changes.

The problem is often that our view of the system has been primed. We fall foul of the anchoring bias, and can not easily see that 90% of the system has not been examined. Our testing returns again and again to checking the recent changes and their surrounds. Even when these are probably the best and most recently tested parts of the system. Much like the person in the audience called to the stage in a magic show - I've been primed "to pick a number any number, could be 5, could be 11 ...any number you like.". I'm unlikely to suggest any negative or very large numbers by the time I reach the stage.

What I've found to be useful is applying the same concept the hotel sales team used. To help reduce that bias, I invert the game. Instead of looking at the SCRUM/KANBAN board or bug tracking system, that lists the known stories or known defects. I look at a different list or even several lists. These are usually either a checklist of system areas or a something deliberately not in the affected system areas. I then pick an item and investigate its behaviour as if it was new. The very fact I'm not repeating the same ground as everyone else is increasing the chances that I will find an issue. Whats better, these are just the sort of not quite-related but somehow broken-indirectly bugs that regression testing is aimed at.

So rather than having a board of defects and stories that you are itching to remove, instead have a board with a card for each section of your software. Divide up your time before the release and start working through the items. You can prioritise your testing however you like, but remember you have already focused a lot of time on certain areas, in specific release-change related testing. What's left are the unexplored areas, the undiscovered bugs.

Thursday, 1 March 2012

Manual means using your hands (and your head)

I recently purchased a Samsung Galaxy Tab and an iPad2. Unlike many of my previous gadget purchases, these new gadgets have become very much part of the way I now work and play. One thing I like about them, is their tactile nature. You have a real sense that their is less barrier between you and what you want to do. If you want to do something - you touch it - and it 'just' does it. I don't have to look at a different device, click a couple of keys or move a box on a string to get access to what I can see right in front of me.

Features such as the haptic feedback provide a greater feeling that you are actually working with a tool, rather than herding unresponsive 'icons' or typing magic incantations into a typing device, originally conceived 300 years ago.

The underlying software systems used in these devices is a UNIX variant, just like the computer systems that underpin the majority of real world systems from the internet to a developer's shiny Apple Mac or Linux workstation. UNIX was initially developed 40 years ago while it has been re-written, ported and improved over the years, as far as these devices are concerned its the stable platform upon which the magic happens.

Part of that magic is variously called the 'interface', GUI or more vaguely the 'experience'. There has been an increasing availability of devices with improved interfaces for many years. The introduction of the command line itself - a little more friendly than punch cards. Graphical menus and keyboard shortcuts. The windowing system, making managing those command line terminals and applications a little easier, especially if you invested in a box on a string. Affordable, portable and powerful computers with touch screens and software that uses these features are just another example of this.

From a testing perspective these are exciting new areas to expand our skills and of course challenges to overcome. We get to learn about these tools and toys and how people use them. We also need to grasp how they work - and what they can't do. Yet as I mentioned the underlying technology is conveniently similar. They are still UNIX, they probably speak to other computers using TCP/IP just like your desktop computer does. Much of server-communication under-the-hood probably uses HTTP just like your GMail.

As the 'interface' gets more human oriented. More like the other 'real' tools in our lives they get easier to use. But this of course means they are more removed from what computers themselves are able to work with. My computer doesn't really have a concept of what 'touch' is: just a way of handling such events. It can't sense a slightly clunky window drag and drop. When we write software to try and test a window drag and drop, we can make it reliably apply the correct events in the right sequence, and check another sequence of events and actions took place. Beyond that we have little knowledge about what is going on as far as our user is concerned.

We often kid our selves that we are testing, for example drag and drop, but in reality we are checking that a sequence of events happened and were received and processed appropriately. Thats fine, and probably a good idea - but its not actually seeing how well drag and drop works. It might not 'work' at all for our user.

I once worked on a project that included a complex web navigation menu system. Multiple configurations were possible, and depending on the users context various menu configurations and styling would be displayed. A great deal of effort had been spent on test automation to 'test' this menu-ing system. Yet shortly before a release the CTO took a look at the system and noticed the menu was missing, he was not impressed. A recent code 'fix' had inadvertently altered the page layout and the menu was entirely invisible. The test automation was oblivious to this, even if it had checked the menus 'visibility' setting, it would not have detected the problem. The menu was set to 'visible', but unfortunately another component had been placed on-top, obscuring the menu.

This situation is a classic GUI-test-automation mistake. It highlights the common problem with test automation that tries to 'play human'. The test automation couldn't see the page and its missing menu. Why do we keep trying to get our computers to do this? The final arbiter of whether a feature is visible is the user's brain, not Selenium's object model.

I recently took a look at the Google Docs app on my new Galaxy Tab. I created a new blank spreadsheet and just started to click on the cells to enter some numbers. I use spreadsheets on Google Docs frequently and felt testing that I am able to do this on my tablet would be worthwhile. My expectations were high, I'm using the Google Docs App, on [Google] Android software on a market leading Android-supporting hardware. I click on a cell in the spreadsheet, and I feel the haptic response, I know the tablet-knows I've done something. I see the cell become editable.

I wanted to see the range of menu options available, for the cell. As is normal for touch screens and some desktop software I press and hold on the cell. (thats pretty much the equivalent of the right mouse button) Oops. The whole spreadsheet disappears...

I try various similar manoeuvres, all typical tablet interface commands (us tablet kids call them gestures). These tend to give varied results from 'expected' behaviour such as displaying the option to Paste - to causing parts of the spreadsheet to disappear. This is a high impact flaw in the application, one that a human tester would find in seconds. (SECONDS I tell you!)

This exposes again a lack of what I mean by manual testing. That is actually using your hands to test things. Literally: your hands. I found that if 'pressed and held' for a certain amount of time the spreadsheet would not go blank. But that 'press-but not too quick and not for too long' technique was obviously useless for normal usage. This defect is a blocking issue for me, I do not use Google Docs App on my Tablet. I use Polaris, an app that comes free and installed: it has no such issues - and allows me to upload files to the Google Docs server, or email the files etc.

The big 'A' Agile crowd and waterfall/v-model die-hards alike fall into a polarised debate about the need for 'manual [X]OR automated' testing, but really they are not grokking the need for testing, and testing using the right tools in the right places. Those test tools might be:

  • a logging or monitoring program/script running on those underlying UNIX systems.
  • a fake Google Docs server that lets you check the client app against a server in 'known states'.
  • a fake Google Docs client app...
  • a javascript library that exposes the details of the client application or interacts and triggers events as required. 
  • a tool that creates random / diverse spreadsheet data - and checks for problems/errors in the server or client etc.
  • a tool that can apply load and measure system performance of those HTTP calls.
(Notice the pattern, test automation is good at doing and checking machine/code oriented things in ways that people are not.)

Or even:

  • Your hand(s): The haptic feedback doesn't work for XYZ, The tablet can still be hard to use one handed - can we fix that? Why can't I zoom-gesture on this Sunday Times magazine?
  • Your eyes, e.g.: observing instantly the HTTP traffic in a monitoring tool...
  • Your ears: The screen brightness adjusts to ambient light levels, but the speaker does not adjust to ambient noise levels...

All these tools need to be considered in conjunction. E.g.: Now we know the application is prone to these 'disappearing' tricks - how can we (1) stop it happening? (2) detect when it does? and discuss the merits of doing either or both. Sometimes it makes sense to divide our resources and write test automation and write the fix - for example when you suspect you haven't caught all the causes of the issue. But that inevitable drag away from more testing means you don't find the next bug because your team is still coding the code-fix and test for the code-fix (or the maintenance of both). This is why the decision, on how to proceed at that point, is always context sensitive.

Its not that one should use manual or automated testing, its a question of asking What am I trying to do? What tool do I need? For example: If development has not started, then the best tools might be the tester's brain and a white board rather than a bloated java framework. If you don't know what you need to test - then your hands and eyes will quickly give you valuable feedback as where to go next. For example: the application seems sluggish - we need to check performance and network latency. Or the spreadsheet disappears! - We need to be able to automatically generate those events - and reliably check the visibility.

Thursday, 5 January 2012

Nobody expects the...

In a previous post I discussed one method I use to improve my testing skills, spending spare minutes testing a machine or website that is readily at hand. The example I used was Google's search, in particular its currency conversion feature. This is useful for getting practice, and trying to speed up my testing, that is - finding information more quickly.

Another activity I perform is watching someone else test something. As testers, we are often asked to be a second pair of eyes, as its assumed that a programmer might not notice some issues in their own code. The idea being that you will not be blinded by the same assumptions, and will hopefully find new issues with the software. Using the same logic, by watching someone else test, I can examine their successes and failures more easily.

I've asked many people to test a variety of objects, usually things to hand, like a wristwatch or something I've recently bought. One recurring pattern I have noticed is how programmers and testers approach the problems. That is, they tend to use different techniques and I think this is because testers have a slightly different underlying approach.

For example I once gave a toy to a colleague to 'test'. The description I gave was limited: A small plastic/rubber toy, aimed at toddlers and above. Bought for a few pounds. You can see what it looks like here:

The tests suggested were good, a range of things that I would hope any toy my son had would of been subjected to. For example toxicity, tearing etc. They also examined the toy and noticed a spike had been torn - after which I explained the toy had previously had a brightly coloured chord loop or lanyard that had been torn off. This also produced a few more relevant tests, all good.

After a while, the suggested tests had been exhausted, as well as the associated questions, such as "Did it come with a manual?" (The answer was no - except for a piece of paper with words to the effect of "Made in China, do not burn."). These were all good tests and questions. But I've noticed that testers would ask many of the same questions and suggest similar tests, but also suggest another group of tests.

Programmers are builders, they focus on what they are constructing. Their experience tends to cause them to follow very much what they are presented with. The plan, the specification, the system they are upgrading. As such, when presented with a testing problem, their tests focus on the same aspects, quite rightly. If they were building the system themselves, their tests and questions would all be what I'd expect them to do. Experienced and skilled programmers bring a wealth of background knowledge that can make their work very thorough, and of high quality.

As such, good programmers can often do a reasonable job of software testing. There is one area of testing that I have noticed that programmers tend to miss. Good testers will often try to find areas they don't know about (Rather than examining those that they do know in greater and greater detail). They have techniques for breaking out of their own view of the problem. While a programmer will often only perform a test if they can frame it back to a 'requirement' testers often perform a test - because they can.

The sorts of things testers suggest is pretty interesting, and varied, but they tend to be destructive. I've asked testers, after they have suggested an "off the wall" test that surprised me, "Why would you do that?". The responses vary, and I suspect that the justification is often being generated when I ask. Thats not a problem, much of what we do in testing is not "named". They are techniques people have learned by doing, and maybe never had reason to analyse and put a name to. What I think the testers are doing is performing "something" that will expose new behaviour.

They have learned that by doing predictable things, you will tend to get predictable answers. If you work with the same assumptions and behaviours that the rest of your team do, then you are unlikely to see new and interesting behaviour. By, for example, when asked to test a wristwatch - they suggest removing the battery or throwing the watch in the sea, that may seem a little strange. They certainly don't seem to match with the Conditions of Satisfaction. But they might actually help identify important features of the wristwatch, that otherwise might not of been discovered. For example the watch was made of Titanium, which does not rust in salt-water. Or that the wristwatch was powered by the motion of the user, as well as by a 'backup' battery.

The testers have learned that getting another viewpoint, can discover new information. And as information gatherers, thats a pretty important achievement. They are climbing a tree, not to see if the tree is climbable but rather to find out whats there? As such they suddenly see the size of the forest, the life supported in and around the tree or that they have a pine-wood allergy. In the toy example above, if the tester had picked up the toy and crushed it, they would have noticed it start to flash bright colours.