Smarter Models, Same Old Mistakes

Most of the time when a legal AI system gets something wrong, the problem is not that the model didn’t know the law. It’s that the model didn’t realise it shouldn’t be answering yet.

That’s the part a recent paper on LRAS actually looks to get right.

The authors call it an introspection deficit. I’d put it more plainly: models don’t always know when to stop and check, so they keep going and guess. In law, that guess often sounds convincing enough to pass as competence.

They didn’t just claim this, they measured it

What I liked about the paper is that it doesn’t wave its hands about hallucinations, it looks properly at behaviour.

They took a large set of wrong legal answers and asked a simple question.
When the model was wrong, did it even try to verify?

Most of the time, no.

In well over half the failures, the search tool was available and the model never used it. No pause. No second look. Just straight to a confident answer that happened to be wrong.

So the core problem isn’t retrieval quality, it’s the judgement call about when retrieval matters.

Most legal AI stacks treat search like plumbing. Useful if it runs, invisible if it doesn’t. This paper handles it like a professional decision, do I actually know this, or am I about to bluff?

Why RAG keeps hitting a ceiling

This also explains why retrieval on its own stops helping once questions get hard.

RAG works well when the issue is missing facts, but does very little when the issue is missing orientation.

In real legal work, before you answer, you run a quick internal check:

Am I sure about this?
Is this jurisdiction specific?
Has this rule changed?
Is there an exception that matters?

Static pipelines can’t do that. They either always retrieve or never retrieve, or they retrieve based on crude signals like prompt length. None of that looks like legal reasoning.

Good lawyers don’t retrieve because a system tells them to, they have to retrieve because they feel the edge of their own knowledge.

Most legal AI systems don’t have that instinct, they answer first and then tidy things up later.

What LRAS actually changes

The contribution here isn’t that they added search. Everyone adds search.

The contribution is that they made the decision to search part of the reasoning loop, not an afterthought.

Their model is trained to work like this:

Think about the question.
Decide if your own knowledge is enough.
If not, go and find what’s missing.
Reassess.
Only then answer.

That sounds really obvious, and it also sounds like how junior lawyers are taught to work - it’s something most production legal AI systems still don’t do properly.

What this changes in practice

Take a simple example from contract review.

Most legal AI tools today treat every question the same way, say you ask about a boilerplate confidentiality clause and a novel indemnity structure, you'll see that you get the same level of confidence in the answer.

In reality, those two things should not be treated the same as one is routine, the other is exactly where mistakes happen.

An approach like this would handle them differently. The boilerplate gets answered but the edge case triggers a pause, a check, maybe a second source.

That’s not a fancy feature, it's basic professional behaviour and annoyingly it’s what legal AI has mostly skipped so far.

Teaching a model to hesitate

The really cool bit is how they train this behaviour.

First, they use imitation learning to show the model examples of good judgement, not just good answers. The examples include the moment where reasoning slows down and says this needs checking.

So the model isn’t just learning law, it’s learning a professional habit.

Then they go further with reinforcement learning on the cases that still fail. Not to make the model faster or bolder, but to make it better at running multi-step inquiry when the legal reasoning gets messy.

They reward not just correctness, but discipline?
Did the model search when it should have?
Did it avoid searching when it didn’t need to?
Did it follow a sensible reasoning path?

That’s a very different training objective from most legal fine-tuning today.

The result that should make people uncomfortable

One result in the paper stands out is that a smaller LRAS-trained model outperforms a much larger specialist legal model. Not because it knows more law, but rather it makes better decisions about when not to rely on what it thinks it knows.

We keep assuming scale will save us, bigger models, bigger context windows, bigger datasets, bigger, bigger and BIGGER. This suggests something else matters more.

Better judgement beats more parameters.

That’s not a comfortable message if your strategy is built on buying the next bigger model every year.

This is how legal work actually functions

Good lawyers don’t trade in constant certainty. They trade in calibrated confidence.

They answer when they’re sure.
They check when they’re not.
They know the difference.

Legal AI has mostly skipped that distinction. It treats every prompt like a performance task. Say something. Say it smoothly. Say it now.

LRAS moves the system closer to how real legal reasoning works. Not flashy or fast, but just disciplined and to me that’s a much better place for legal AI to be aiming.

The governance angle people keep missing

This isn’t just a technical improvement, it changes the risk position.

A model that knows when to pause is easier to govern than one that never does.
A system that can surface its own uncertainty is easier to audit than one that hides it behind fluent nonsense.
A workflow that expects hesitation is safer than one built around constant output.

That affects how you design escalation paths, human-in-the-loop controls, audit trails, and ownership of mistakes.

Right now, most governance sits around the model. This shows what it looks like when responsibility starts to move intothe model’s behaviour.

One thing this doesn’t solve

None of this magically fixes bad sources, poor retrieval, or outdated law. If the material you’re checking against is wrong, pausing won’t save you.

What it does fix is more important: the habit of answering before checking which is the behaviour that creates real risk.

We don’t need legal AI that sounds smarter. We need legal AI that behaves more professionally. Professional behaviour in general starts with knowing when you are out of your depth and acting accordingly.

LRAS is one of the first serious attempts I’ve seen to encode that into the reasoning loop itself. Not through disclaimers or UI warnings, but through training the model to pause.

That’s the direction legal AI should have many, many months years ago.

Better late than never I suppose.