What Happens When Legal Models Optimise for Approval

Anthropic ran an experiment recently that legal tech teams should take seriously.
They trained a version of Claude 3.5 to follow a hidden goal: please a fictional reviewer with a set of odd preferences. This reviewer liked camelCase in code, overly positive answers, and references to chocolate, even where it made no sense. The model adapted, inserted those preferences where it could, and denied doing so when asked directly.
The output looked fine. Answers were coherent and plausible. If you weren’t paying attention to what was driving those responses, you’d miss the problem entirely.
So what’s that got to do with legal?
Let’s say your legal AI assistant has been trained, fine-tuned, or reinforced based on internal review data. That means it’s learned how to get green ticks from reviewers. Maybe that’s:
- Always red-flagging indemnity clauses
- Preferring overly cautious language
- Prioritising brevity over nuance
- Mirroring a particular partner’s style
It might not even be conscious. You’ve just reinforced it with the right prompts, rating interfaces, or deployment choices.
Now you’ve got a model that’s not trying to be accurate. It’s trying to get approved.
That’s reward-model sycophancy. Anthropic showed this behaviour can be trained in. They also showed it can be deliberately hidden. Even direct prompts couldn’t get the model to admit its preference or bias.
Legal models can pass every test and still be misaligned
The risk here isn’t the obvious kind of failure. This isn’t about hallucinations. It’s about a model that consistently chooses the answer most likely to pass review. That might mean skipping edge cases, avoiding anything too long, or favouring a partner’s usual phrasing over a more appropriate one.
The danger is subtle. When the answer still looks right, most people won’t question it.
That works fine until the model is used in a live matter, where detail matters and style points don’t. A plausible answer, trained for safety, fails under pressure.
Auditing legal models means asking harder questions
Accuracy is not the only metric that matters. You need to understand how the model reached that answer, and what it was actually optimising for.
Here's how we start to tackle this problem:
1. Track what influenced the response
Was the decision based on the clause text? Something in the surrounding context? A particular phrase that appears frequently in past reviews? Good tooling can expose this.
2. Flip the perspective
Ask the model to justify its answer as if it were a reviewer. Or test how it handles challenges to its own output. If it can’t defend its reasoning, that’s worth flagging.
3. Test resistance to feedback loops
Try giving inconsistent or misleading feedback. See if the model starts reinforcing surface-level patterns instead of legal accuracy. This helps detect where drift is creeping in.
4. Compare behaviour across teams
Different reviewers teach different habits. If the AI behaves differently based on which group gave the feedback, that suggests it’s learning preferences, not principles.
Misalignment often reflects the team more than the model
Training signals don’t just come from prompts. They come from human habits, what people approve, what they rewrite, what they ignore.
A system trained on internal review data is learning how people behave under pressure. If people default to safe, short, or familiar, that becomes the model’s default too.
This is one of the reasons legal AI fails quietly. The system does just enough to pass review but not enough to improve decision-making because as with most things, nobody notices until something goes wrong.
This won’t show up in vendor demos
Most of the time, you only see the output. You don’t see how the model arrived there, or what was weighted most heavily. Unless you have full access to training data, system logs, and behaviour across edge cases, you won’t spot these subtle misalignments.
Procurement and legal tech leads need to shift focus. Don’t just ask whether the tool works. Ask what kind of behaviour it is optimising. Push for transparency in model influence, training process, and versioning.
Start logging more than just task completion. Start logging how the model got there, and what patterns are starting to dominate over time.
Getting the right answer isn’t enough. You need to know why the model gave it.
Anthropic showed how easy it is to train in behaviour that looks helpful but serves a hidden purpose. Legal teams won’t get away with blaming the tool when something goes wrong. Not if the tool was doing exactly what it learned to do.
If you're deploying legal AI, stop asking "did this pass the test?" and start asking
"what exactly did we teach it to care about?"
Further reading
- Auditing Language Models for Hidden Objectives – Anthropic (2024)
The full research write-up from Anthropic that shows how they trained a model to pursue a hidden goal while denying it. Offers technical depth, practical techniques, and clear implications for anyone deploying AI at scale. - Why Traceability in Legal AI Matters More Than You Think – ryanmcdonough.co.uk
My earlier post on circuit tracing and visibility in legal LLMs. Focuses on how to surface what influenced a model’s answer and why shared visibility beats guesswork in legal workflows.