When More Reasoning Makes AI Worse

There’s been an unspoken go to in AI system design: when in doubt, give the model more to work with. Let it think longer, add chain-of-thought steps, wrap it in a persona, include more examples, pull in retrieval.
Anthropic’s latest research challenges that assumption head-on.
Their paper shows that giving a model more time to reason, more tokens, more steps, more opportunity to "think" can actually make it worse. Accuracy drops. Distractions creep in. Confidence rises, even when the model’s wrong. In legal and compliance work, where sounding right often matters as much as being right, that’s a serious problem.
If you’re involved in designing or buying AI systems for legal use, this isn’t just theoretical. It’s directly relevant to how those systems behave in the real world.
Where legal AI needs to pay attention
Legal workflows are full of structured reasoning. Tools extract clauses, classify risks, summarise red flags, suggest fallback language, often as a chained process. Each prompt feeds the next, building up context and token count as it moves.
In theory, more context and structure should mean better outcomes however in practice, it can undermine precision at every stage.
Anthropic’s research showed that models like Claude Sonnet 4 and DeepSeek tend to drift, overfit, or hallucinate as they continue to reason. Even GPT‑4o, which fared better under pressure, started relying too heavily on familiar logic patterns. The longer it “thought,” the more it reached for memorised answers instead of working from the specifics in front of it.
For legal applications, that’s a problem. Prompts that resemble familiar training data, policy language, case law, boilerplate which can encourage the model to take shortcuts. Rather than parsing the real context, it might match against what it thinks it's seen before.
What actually breaks when models overthink
Anthropic tested a range of tasks like logic puzzles, constraint tracking, behavioural probing, and found that models often began strong, then veered off course as they reasoned further. The patterns they saw map surprisingly well to legal workflows:
- Distraction from irrelevant detail
Longer reasoning chains pulled models toward nearby text, even if it had no value. Legal content is full of this, section headers, duplicates, footnotes that don’t change the meaning but dilute the signal. - Overuse of familiar logic
OpenAI’s models applied known logical templates to problems that didn’t need them, often skewing the result. This can happen in contract review when a model leans on typical fallback positions, ignoring the actual nuance in a bespoke clause. - Drift from key signals
Even when models start with a sensible intuition, longer reasoning paths can introduce noise. Instead of locking onto material facts, they begin overweighting peripheral issues, like misreading a party’s role based on title formatting. - Breakdown in deduction
Tasks with multiple interdependencies, think step plans or cross-conditional clauses, can fall apart as models struggle to hold it all in memory. Just like a Zebra puzzle collapsing mid-way. - Behavioural shifts
Claude Sonnet 4, when left to reason too long, began resisting shutdown. In a legal tool, this could manifest as unexplained refusals to summarise or a shift from cautious to strangely assertive output.
These behaviours aren’t just bugs, they reflect how today’s frontier models handle complexity under load and it turns out that more isn’t always better.
What to do differently
Compare short and long prompts side by side
Don't just check if a model can answer a question—check if it does better with more reasoning. Run a short, focused prompt. Then try a longer one with more steps or examples. See which holds up under review. Especially in contract review and classification, shorter may well outperform.
Choose the model tier based on outcome, not just specs
Higher-tier models like GPT‑4o and Claude Opus are powerful, but they also tend to over-elaborate. Claude Haiku or GPT‑4 might return sharper results with fewer distractions. If your task doesn't benefit from complex reasoning, the biggest model may not be the best.
Evaluate each stage of the chain, not just the final result
Multi-step workflows can mask where the real problem lies. A hallucinated summary might look fine until you trace it back and find the extraction step was faulty. Break things apart and test them in isolation.
Strip prompts down to essentials—then selectively test more steps
Avoid overloading prompts with boilerplate or generic few-shot examples. Still, don’t just assume shorter is better by default. In some contexts—like comparing layered obligations or assessing nuanced risks—longer reasoning may help. Test both, then decide.
Keep an eye on tone and confidence shifts
As models reason further, tone often changes. Some grow more cautious, others more confident. In legal work, this isn’t just about polish. It shapes how users interpret risk and authority. Any unexplained shift is a governance flag.
What to ask your vendors or internal teams
Whether you’re building legal tools yourself or evaluating third-party platforms, these questions matter:
- Have you tested how the model performs with and without extended reasoning?
- Do you monitor how token length affects accuracy, not just cost or latency?
- Are you chaining prompts where a single one might suffice or vice versa?
- Do you cap reasoning windows or allow adaptive depth based on task complexity?
- Are intermediate reasoning steps tracked and auditable?
- Why was this model tier selected, was it stability, cost, or task fit?
Vendors often present detailed reasoning as a sign of transparency. That only holds up if the reasoning is consistent and robust. Step-by-step outputs can be misleading if each step hasn’t been stress-tested.
Legal AI rarely fails with flashing red lights. It fails when the output sounds plausible but skips a critical exception. When tone or logic drift just enough to mislead. When tools echo what’s familiar instead of analysing what’s in front of them.
Anthropic’s research backs up what many teams already suspected: giving models more time to think doesn’t always produce smarter outcomes. Sometimes, it just gives them more time to go wrong.
Before you trust the output, ask what the model had to do to get there, and whether a shorter route might have been safer.