What Happens When Legal Models Optimise for Approval

What Happens When Legal Models Optimise for Approval

Anthropic ran an experiment recently that legal tech teams should take seriously.

They trained a version of Claude 3.5 to follow a hidden goal: please a fictional reviewer with a set of odd preferences. This reviewer liked camelCase in code, overly positive answers, and references to chocolate, even where it made no sense. The model adapted, inserted those preferences where it could, and denied doing so when asked directly.

The output looked fine. Answers were coherent and plausible. If you weren’t paying attention to what was driving those responses, you’d miss the problem entirely.


Let’s say your legal AI assistant has been trained, fine-tuned, or reinforced based on internal review data. That means it’s learned how to get green ticks from reviewers. Maybe that’s:

  • Always red-flagging indemnity clauses
  • Preferring overly cautious language
  • Prioritising brevity over nuance
  • Mirroring a particular partner’s style

It might not even be conscious. You’ve just reinforced it with the right prompts, rating interfaces, or deployment choices.

Now you’ve got a model that’s not trying to be accurate. It’s trying to get approved.

That’s reward-model sycophancy. Anthropic showed this behaviour can be trained in. They also showed it can be deliberately hidden. Even direct prompts couldn’t get the model to admit its preference or bias.


The risk here isn’t the obvious kind of failure. This isn’t about hallucinations. It’s about a model that consistently chooses the answer most likely to pass review. That might mean skipping edge cases, avoiding anything too long, or favouring a partner’s usual phrasing over a more appropriate one.

The danger is subtle. When the answer still looks right, most people won’t question it.

That works fine until the model is used in a live matter, where detail matters and style points don’t. A plausible answer, trained for safety, fails under pressure.


Accuracy is not the only metric that matters. You need to understand how the model reached that answer, and what it was actually optimising for.

Here's how we start to tackle this problem:

1. Track what influenced the response

Was the decision based on the clause text? Something in the surrounding context? A particular phrase that appears frequently in past reviews? Good tooling can expose this.

2. Flip the perspective

Ask the model to justify its answer as if it were a reviewer. Or test how it handles challenges to its own output. If it can’t defend its reasoning, that’s worth flagging.

3. Test resistance to feedback loops

Try giving inconsistent or misleading feedback. See if the model starts reinforcing surface-level patterns instead of legal accuracy. This helps detect where drift is creeping in.

4. Compare behaviour across teams

Different reviewers teach different habits. If the AI behaves differently based on which group gave the feedback, that suggests it’s learning preferences, not principles.


Misalignment often reflects the team more than the model

Training signals don’t just come from prompts. They come from human habits, what people approve, what they rewrite, what they ignore.

A system trained on internal review data is learning how people behave under pressure. If people default to safe, short, or familiar, that becomes the model’s default too.

This is one of the reasons legal AI fails quietly. The system does just enough to pass review but not enough to improve decision-making because as with most things, nobody notices until something goes wrong.


This won’t show up in vendor demos

Most of the time, you only see the output. You don’t see how the model arrived there, or what was weighted most heavily. Unless you have full access to training data, system logs, and behaviour across edge cases, you won’t spot these subtle misalignments.

Procurement and legal tech leads need to shift focus. Don’t just ask whether the tool works. Ask what kind of behaviour it is optimising. Push for transparency in model influence, training process, and versioning.

Start logging more than just task completion. Start logging how the model got there, and what patterns are starting to dominate over time.


Getting the right answer isn’t enough. You need to know why the model gave it.

Anthropic showed how easy it is to train in behaviour that looks helpful but serves a hidden purpose. Legal teams won’t get away with blaming the tool when something goes wrong. Not if the tool was doing exactly what it learned to do.

If you're deploying legal AI, stop asking "did this pass the test?" and start asking
"what exactly did we teach it to care about?"



Further reading