AI Agents & the aptly named Butt report

It's August 1941, during the opening years of WW2, the axis powers had conquered much of mainland Europe -and the Allies were desperately trying to limit the capacity of the axis military industrial complex. They did this with little feedback on their success - much like some modern teams using AI tools.

The Royal Air Force's (RAF) strategic bombing campaign had applied considerable resources to the task of diminishing continental targets such as occupied French ports or the factories of the Ruhr valley. Much as the northwest and northeast of England were the industrial centres of Britain, the Ruhr was where a large part of the Axis industrial might lay, and so was an important military target.

The air crews reported back after their sorties (missions), stating how successful their missions had been. If such reports were to be believed, Germany and the axis powers should be decimated, be ‘on their knees’, and would fall or at least falter in short order. Wisely the air ministry asked for evidence and began fitting cameras to bombers and reconnaissance aircraft.

An analysis of several hundred sorties, comparing mission targets to photographic evidence became known as the Butt report (after is author) and it was brutal in its conclusions.

A summary, overall: only 1 in 3 aircraft recorded as attacking their target got within five miles (approx. 8km). Note: that’s crews that recorded that they had attacked the target and excludes sorties where they were unable to reach or locate their targets.

But the high-level figure of 1 in 3 hides other details. The phase of the moon (and hence ambient light levels) had a dramatic effect. Under a full moon, attacks were successful (within 5 miles/8km) 2 in 5 times, but on a darker 'new moon' night this dropped to just 1 in 15 (about 7%).

A closer look showed that while occupied French ports could typically be "hit" 2 in 3 times (again I question whether being 4.9 miles away classes as 'hit'), the industrial heartland of the Ruhr was typically hit just 10% of time. Given success was measured as "within a radius of 5 miles" or approx. 75 square miles the notion that the strategic bombing campaign was anything more than an annoyance to the Axis industrial base, was fanciful.

It's admirable that the air ministry became sceptical and decided to test the claims with photography and statistical analysis. But how many aircrews and civilians on the ground had been killed or injured due to these unchecked & errant raids? let alone the waste of materiel, that could have been used elsewhere in the war to greater effect.

What this highlights is in agile-speak is the theory of constraints. New aircraft and technology had enabled larger scale aerial bombardment across the continent, but that had (eventually) exposed the pitiful state of the navigation and verification parts of the “strategic” bombing process.

A snippet of the Butt report's conclusions.

This same process is playing out in the AI agents domain right now. Enabled by new technology (Large Language Models trained for programming & agent harnesses like Claude code etc) teams can write software quicker, but the means of assessing their success still lags. As Zack Akil (Google) states: what you want is Immediately Verifiable Output.

If your agents' output can be rapidly assessed, then you can both cover more ground and reach a higher degree of quality (and reliably know you have). That’s not to belittle the quantity of such problems, it's not uncommon to have problems that are hard "to do" but easy to verify (think Sudoku, jigsaw puzzles, Rubik, cubes or for the mathematical among you factoring large integers into constituent primes - and several - more business focused areas).

This is a boon for some companies. If your business is automated validation, this is your time to shine, your validator can provide just the sort of Immediately Verifiable Output that agents need, to home-in on a solution - That can be quickly verified. (see my previous post for an example of how this can be done)

Where things become messy is pretty much everywhere else. How do I know that AI agent-built system for managing your company’s finances is correct? The agent created some tests - but are they good? are they checking the right things? As Tim Harford recently put it in the Financial Times:

"The most problematic cases are the ones where it is hard to know whether the AI has done a good job, and expensive if it turns out it has not"

Another agent focused on testing could help (I imagine some of you are replying) - and your right it could. but then we are looking at a common - method scenario, where both building and testing are done by the same tool, using the same data and preconceptions. It’s not to say models/agents cheating deliverables is worthless - it isn't, it’s just not the be-all-and-end-all, especially given the capabilities of current models. Future models will no doubt improve.

Another approach is to provide such a high degree of fidelity and detail into the AI agents instructions that can't help but deliver the correct result. There is a long history of design specification approach to software development, ranging from formally defined software using tools like Z notation through to more 'agentic' approaches like specification driven development (SDD). They suffer from an inability to handle the messy realities of some software. As one of my computer science professors once said: you try formally specifying the sort of GUI/App the customer just drawn on the whiteboard. Assuming the customer even knows what they want. (They usually think they do). But once again, better specifications, especially those that can be iterated on quickly after the agent completes coding each version - will be an important part of the solution - essentially a new and higher level of abstraction.

But of course, this is specs, models and turtles all the way down. At some point we need a person to decide "does the software I paid for - do what’s required of it." Does it make my business more money? will it upset the financial regulator? or land me on the receiving end of a lawsuit?

Some of those questions can be answered by "Shipping" the code, that is just give the app to the customers and they will either love it and subscribe etc or not. But many businesses can't do that. And they tend to be big & profitable and make up the bulk of the market outside the bubble of big tech. (think energy, finance, banks, law firms, aerospace etc)

Those sorts of enterprises need evidence and a person to testify (sometimes literally) that the app does what it "says on the tin" and let them know if the app does these other things, they may not like… And hence implicitly guide their agents to improvement. The investigators might need to use AI agents, engineers, testers or other products and automated tools to provide the feedback. The job title of the people doing that work might be not one in use today - but that testing process will need to be there.

So, what will happen in practice? Who knows for sure! probably all the above (including some lawsuits), but I think the role of engineers will move towards the role of detecting problems and specifying a better solution rather than coding in the traditional sense. So, in old money more solution architecting and test architecting, with probably still a split between the two roles - to better facilitate the better checking and investigation aspects of the test architect while freeing the software solutions architect to explore novel means of system design.

Connect on LinkedIn

Can 'reasoning' LLMs help with recs data creation?

A nervous tourist, glances back and forth between their phone and the street sign. They then rotate their phone 180 degrees, pauses, blink and frown. The lost traveller, flags a nearby ‘local’ (the passer-by has a dog on a lead. “Excuse me…” she squeaks, “How may I get to Tower Hill?” “Well, that’ s a good one” ponders the dog walker, “You know…” “Yes?” queries the tourist hopefully. “Yeah…” A long pause ensues then, “Well I wouldn’t start from here” He states confidently. The tourist almost visibly deflates and starts looking for an exit. That’s often how we start off in software testing. Despite the flood of methodologies, tips on pairing, power of three-ing, backlog grooming, automating, refining and all the other … ings ) We often find ourselves having to figure out and therefore ‘test’ a piece of software by us ing it. And that’s good. Its powerful, and effective if done right. But, like our dog walker, we can sometimes find ourselves somewhere unfamiliar...

investigating software

Search This Blog

AI Agents & the aptly named Butt report

Labels

Comments

Post a Comment

Popular posts from this blog

Don't be a Vogon, make it easy to access your test data!

Can 'reasoning' LLMs help with recs data creation?

What possible use could Gen AI be to me? (Part 1)