Skip to main content

Vimes Boots & why the right AI evals could save your project

"The reason the rich were so rich, Vimes reasoned, was because They managed to spend less money" - Sam Vimes, from Men at Arms by Terry Pratchett.

The theory goes that richer people can afford a pair of boots that last longer before repair or replacement than poor people, who can only afford cheaper and less hardy boots requiring replacement in just a ⅓ of the time.

It’s the old "buy cheap buy twice" idea and it still applies today, especially in the world of AI agents and LLMs. Currently your agent is 'managed' or 'directed' at least in part by a harness. In fact, the IDE coding agent is really a harness + an LLM. The LLM “writes the code” and requests tool use, the harness enables those tool calls. E.g.: Write a file or run a command line tool etc.

In one sense the LLM does the productive side of the agents work while the harness has a more deterministic & feedback role. (e.g. reports back that the command the LLM suggested returned result xyz, or the code errored when we ran it)

That dynamic works well, and you can see that in Anthropic's toolset, where the LLM i.e.: Opus 4.7 and the harness i.e.: Claude Code has become hugely effective. Claude Code and other tools with agent capabilities fit together well. But these tools can deliver a slightly perverse incentive, depending on how you pay for your agents. If you pay a flat fee then this might be less of an issue, but if you pay per token, that might influence the vendor’s motivations.

E.g.: What if your LLM isn't very good at what you are using it for? we've all seen the benchmarks, stating how well the LLMs are doing at some maths reasoning etc. But are they good at what you do? so if I think about banking, financial services or payments - does the LLM, create code that works well for your agents operating in your business environment?

Let’s say we're using my new imaginary LLM: “VimesBootsLLM Pro” it’s fast and it's 30% cheaper than the competition and it has “reasoning” and tool use etc. It also passes all the usual benchmarks and is up there with the best of them in the league tables.

But what if you have your agent doing specific tasks (in e.g. banking or insurance) as part of their workflow? E.g.: You might have agent(s) writing code relevant to that domain – Does it perform well? And what does well/good mean here?

If we take a trivial example from bank payments, for example IBAN (International Bank Account Number) validation. Can it create the code repeatedly? How often / what percentage of its code is correct? (and yes, what do we mean by correct…) If our agent “harness” is using this model and finds code errors that terminate execution 50% of the time – the harness is going to be sending that code & result back to the LLM, which is going to rewrite that code maybe doubling the number of tokens used.

What if your tests/evals find the code does not error BUT also doesn't give the correct response 25% of the time? There are edge cases in even simple algorithms like IBAN validation. Then the harness will send this back to the server and the LLM will try again. All this back and forth is costing you money and cluttering your context window (The agent’s short-term memory).

As you can see the headline cost and benchmark stats are like mileage or battery life stats YMMV - Your Mileage May Vary. What’s even more misleading is that “reasoning” models can spend a lot of time just trying internally trying to get to the right answer. That internal reasoning uses tokens and while the price per token cost may be low, if your model is doing twice as much reasoning the cost will be proportionally more, than an LLM that does less “reasoning”.

Thinking this through you can see that it's in a vendor’s interest to maybe not be quite as good at one-shot coding (getting it right first time unaided by similar examples), if they are selling usage by the token. They maybe might create an awesome harness that can handle various values of fail, and handle it gracefully, and get the LLM to fix it next time around, especially if it’s failing reasonably quietly and user is not inconvenienced. It’s a bit like how it’s alleged that some search engines reduced the quality of their search results in a misguided attempt to sell more adverts for a short term-spike in revenue & long-term total collapse of user trust. Or how dating apps are not incentivised to make especially good matches, because if they were almost perfect - you are less likely to keep using them (as you will be dating your one true love – that you met on the first match!)

I’m not suggesting this is Anthropic’s plan, in fact I think they take measures to ensure this isn’t the case for Claude Code users (their recent doubling of allowances is consistent with this viewpoint). But if we look at the industry as a whole and factor in not just IDE based coding agents, but also look at other AI agents, running headless in your back-end systems, are the model maker’s incentives fully aligned with yours? Measuring the cost per successful delivery is a better bet.

Like Vimes and his cheap boots problem, sometimes we need to look at the total cost of ownership over time & how we intend to use the LLM. And it makes sense to look at this without custom skills and harnesses to reduce noise. I don’t mean to say that such agent & skills checks should not be done – they should be done, it’s just they are testing something else, something bigger, and are more akin to end-to-end tests.

DeepSeek V4 Flash does almost as well as Qwen 3.5 at a much lower cost.


Take this set of trivial "eval" results for some payment related functions. Essentially the LLM is provided with a typical bank payment validation functional description and then asked to create code to-implement it. Then the code is unit tested with several examples and the results recorded. Rinse and repeat many times across many function descriptions and LLMs. We then factor in the cost and accuracy, and we can then get a pretty-decent idea of how well the LLMs handle the bank payments domain.

This sort of analysis is going to be of greater concern in the coming months as more teams adopt agentic approaches to solving the problems and keep wondering why they have such high inference costs despite choosing what they thought was a good model and low price! Smart companies will look for this sort of analysis to ensure they don't buy cheap and buy twice.

This sort of analysis of software, of AI tools and teams is what I do, if you want to know more - get in touch.


Comments

Popular posts from this blog

Don't be a Vogon, make it easy to access your test data!

 The beginning of the hitch-hikers guide to the galaxy leads with an alien ship about to destroy the Earth, and the aliens saying we (mankind) should have been more prepared – as a notice had been on display quite clearly – on Alpha Centauri the nearby star system, for 50 years. Seriously, people - what are you moaning about – get with the program?  The book then continues with the theme of bureaucratic rigidity and shallow interpretations of limited data. E.g. The titular guide’s description of the entire Earth is one word: “Harmless”, but after extensive review the new edition will state: “Mostly harmless”. Arthur Dent argues with the Vogons about poor data access This rings true for many software testing work, especially those with externally developed software, be that external to the team or external to the company. The same approaches that teams use to develop their locally developed usually don’t work well. This leads to a large suite of shallow tests that are usually h...

Can 'reasoning' LLMs help with recs data creation?

  A nervous tourist, glances back and forth between their phone and the street sign. They then rotate their phone 180 degrees, pauses, blink and frown. The lost traveller, flags a nearby ‘local’ (the passer-by has a dog on a lead.   “Excuse me…” she squeaks, “How may I get to Tower Hill?” “Well, that’ s a good one” ponders the dog walker, “You know…” “Yes?” queries the tourist hopefully. “Yeah…” A long pause ensues then, “Well I wouldn’t start from here” He states confidently. The tourist almost visibly deflates and starts looking for an exit. That’s often how we start off in software testing. Despite the flood of methodologies, tips on pairing, power of three-ing, backlog grooming, automating, refining and all the other … ings ) We often find ourselves having to figure out and therefore ‘test’ a piece of software by us ing it. And that’s good. Its powerful, and effective if done right. But, like our dog walker, we can sometimes find ourselves somewhere unfamiliar...

What possible use could Gen AI be to me? (Part 1)

There’s a great scene in the Simpsons where the Monorail salesman comes to town and everyone (except Lisa of course) is quickly entranced by Monorail fever… He has an answer for every question and guess what? The Monorail will solve all the problems… somehow. The hype around Generative AI can seem a bit like that, and like Monorail-guy the sales-guy’s assure you Gen AI will solve all your problems - but can be pretty vague on the “how” part of the answer. So I’m going to provide a few short guides into how Generative (& other forms of AI) Artificial Intelligence can help you and your team. I’ll pitch the technical level differently for each one, and we’ll start with something fairly not technical: Custom Chatbots. ChatBots these days have evolved from the crude web sales tools of ten years ago, designed to hoover up leads for the sales team. They can now provide informative answers to questions based on documents or websites. If we take the most famous: Chat GPT 4. If we ignore the...