Choosing Models When the Answer Is Already Correct
I’ve been spending more time than I expected looking at LLM bills.
Not pricing tables or eval posts, just what happens once something’s live and cracking on in the background. When an AI feature stops being experimental and starts existing as infrastructure.
That’s where bad assumptions get expensive, quick.
Most teams begin sensibly. Pick a strong, well-known model, wire it in, ship the feature, move on. Early usage is light with a few known users, the bill is small, and nobody questions the choice, all very responsible.
The problem is that this decision hardens far earlier than it should, the model becomes "the AI", not a component and by the time anyone looks up and asks whether it’s still the right choice, it’s already everywhere.
A Deliberately Boring Test
To challenge that assumption, I ran a deliberately dull benchmark.
One clause, simple question, certainly no clever prompt engineering or edge cases, just the kind of task that shows up constantly in real reviews.
Content
Termination
12.1 Either party may terminate this Agreement immediately on written notice if the other party commits a material breach which is not remedied within thirty (30) days of notice.
12.2 The Customer may terminate this Agreement for convenience on ninety (90) days’ written notice.
12.3 Upon termination for any reason, all outstanding fees shall become immediately payable.
Question
Extract any termination for convenience rights. Quote the relevant clause and identify which party holds the right.
I ran this exact prompt, unchanged, across four different models.
All four got it right, obviously.
Same clause, same quote with the same conclusion. Now from a product or user perspective, the answers were interchangeable. At that point, quality stopped being the interesting variable.
When Quality Is Flat, Everything Else Suddenly Matters
Once correctness is a given, the decision space collapses fast because you’re left with three things that actually affect the system:
- How long the model takes to respond
- How consistent that response time is
- How much you pay per call
That’s it.
| Rank | Model | Provider | Total cost per call ($) | Prompt tokens | Completion tokens | Reasoning tokens |
|---|---|---|---|---|---|---|
| 1 | Gemini 3 Pro (preview) | 0.00246 | 244 | 164 | 103 | |
| 2 | GPT-5 (Aug 2025) | OpenAI | 0.00484 | 243 | 454 | 384 |
| 3 | Claude 4.5 Sonnet | Anthropic | 0.00534 | 305 | 295 | 201 |
| 4 | Grok-4 | xAI | 0.00716 | 917 | 397 | 284 |
In this run, the cost per call ranged from roughly $0.0025 at the low end to just over $0.007 at the high end. Nearly a threefold difference for the same answer.
Latency followed the same pattern. The slower models took noticeably longer end to end, with more variance. The faster ones were quicker and steadier.
Nothing about the output justified that spread.
"It’s Only a Few Pennies" Is How Costs Sneak In
On its own, a difference of a few thousandths of a dollar looks trivial. That’s how these decisions survive review, but this kind of query doesn’t run once.
It runs in bulk, inside background jobs then it runs again when something retries, and that’s where cost creeps in, without ever tripping a budget alarm.
Latency compounds too. Slower calls mean more retries. Retries mean more concurrency. More concurrency means throttling, fallbacks, and defensive logic and then those fallbacks often hit the same expensive model again.
Speed isn’t just about user experience, it'd what shapes system behaviour.
The Frontier Model Problem, in Plain Terms
In this benchmark, the most capable model delivered no additional value. It was slower, cost more and in the end it produced the same correct answer as cheaper alternatives.
That doesn’t make it a bad model, it makes it the wrong default.
Once quality is flat, extra capability has negative marginal value. You are paying for reasoning depth and generality that the task does not require.
For high-volume, low-ambiguity work, it's not sophistication, it’s waste... unless you're looking for one of those little OpenAI awards for spending tonnes of cash.
Why This Keeps Catching Teams Out
I see this pattern almost every time someone actually measures it.
Frontier models are strong across the board, but they’re also rarely the fastest and never the cheapest. Mid-tier and smaller models often sit right on the efficient frontier for narrow tasks, fast enough, cheap enough, and accurate enough.
Most teams never discover this because they never run their real prompts across alternatives. Public benchmarks won’t tell you and vendor comparisons won’t tell you.
Only your own workload does, it's all that ever will.
Speed Is Also a Control Mechanism
There’s a tendency, especially in regulated environments, to assume that the most powerful model is the safest choice.
In practice, the opposite can be true.
Slower systems are harder to reason about. They introduce more retries, more edge cases, more fallback paths. Behaviour becomes less predictable, not more.
Faster, simpler components are easier to constrain, monitor, and audit. Choosing the right-sized model for each task often improves governance as well as cost.
The Rule That Falls Out of the Data
Once you look at the numbers, a very simple rule emerges: if two models produce the same correct answer, the slower and more expensive one should not be in the critical path.
That’s not a controversial position (I hope), it’s basic engineering.
In this single benchmark, removing the most expensive option would have reduced cost immediately, with no loss of quality and better responsiveness. Apply that logic consistently and the savings compound quickly.
What This Actually Requires
This isn’t about building a huge benchmarking platform, though you can if you want.
Take the prompts you already run in production. Don’t polish them. Don’t optimise them. Run them across a wide range of models and record cost and latency alongside correctness.
Decide what "good enough" really means for each task. Once you do that, many decisions stop being subjective. Most AI systems don’t struggle because models are too weak, they struggle because early assumptions are never revisited after making them.
Strong general-purpose models are impressive and genuinely useful, they just shouldn’t sit at the centre of every workflow by default.
Once you benchmark real tasks and control for correctness, cost and speed become the dominant factors, and cheaper models stop feeling risky and start feeling obvious.
This wasn’t a clever experiment, it was a boring measurement exercise that exposed an expensive assumption.
Most teams will see the same thing when they look, they just need to have a good look.