Behaviour ≠ output: the next phase of legal AI evaluation

Output quality and system behaviour are starting to decouple in legal AI.

For the last couple of years, most evaluation approaches have treated them as roughly the same thing. If the summary looked sensible, the citations were accurate and the draft agreement broadly held together, most people were comfortable calling the system "good".

That made sense when these tools mostly behaved like advanced autocomplete systems wrapped inside chat interfaces, because the primary concern was whether the output itself looked plausible enough for a lawyer to work with.

The moment legal AI starts acting agentically, even in relatively constrained ways, the evaluation problem changes completely. You are no longer testing a model responding to a prompt. You are testing a system operating across workflows, retrieval layers, approval chains, memory, tool usage and changing matter context over time.

The timing of two recent announcements made that shift feel pretty obvious.

Harvey released its Legal Agent Benchmark, which is one of the more realistic attempts so far at evaluating AI against actual legal work rather than small prompt-response tasks. Around the same time, Anthropic donated Petri, its behavioural auditing framework, to Meridian Labs.

I do not think those are separate conversations, since they are looking at different layers of the same future problem.

Harvey is effectively asking whether agents can perform meaningful legal tasks inside something resembling real legal workflows, while Petri is asking a much less comfortable question about how systems behave while performing those tasks, particularly once pressure, incentives and repeated interaction patterns begin influencing outcomes.

Most current evaluation approaches still struggle because they flatten fundamentally different kinds of legal reasoning into the same handful of measurements like accuracy, correctness, hallucination rate and benchmark score.

The problem is that legal reasoning itself is not one thing, the framework I keep coming back to is the four different reasoning types sitting underneath most legal AI systems:

deductive reasoning
inductive reasoning
abductive reasoning
analogical reasoning

Each one fails differently, each one drifts differently and each one probably needs different evaluation methods if firms actually want to understand how these systems behave operationally over time.

Deductive reasoning: where legal AI feels safest

This is the category most lawyers instinctively trust because it behaves closest to traditional legal logic.

If clause X exists and condition Y applies, then outcome Z follows.

A lot of structured legal automation sits here already, particularly around compliance checks, policy validation, mandatory clause analysis and procedural verification.

For example, an AI reviewing procurement contracts for missing liability caps is mostly operating deductively. The clause either exists or it does not. The approval threshold either triggered or it did not. These tasks are comparatively stable because there is usually a much clearer definition of correctness and a narrower answer space to validate against.

This is also why early legal AI benchmarks leaned heavily into these kinds of tasks. They are easier to score consistently and much easier to defend operationally because failures tend to be visible quickly.

The problem is that legal work rarely stays deductive for long.

Inductive reasoning: where systems start learning patterns from behaviour

Inductive reasoning starts appearing once systems begin identifying patterns from repeated examples rather than simply applying explicit rules.

An investigations platform notices that certain communication patterns often correlate with escalation risk.
A due diligence workflow identifies drafting structures that frequently lead to post-completion disputes.
A contract review assistant starts recognising which fallback positions usually get accepted during negotiation and which ones consistently create friction.

At that point the system is no longer simply applying logic. It is learning behavioural and contextual patterns probabilistically, which is where things start becoming operationally messy very quickly.

If we have an agentic commercial contracting assistant inside an in house legal team. Initially the system behaves cautiously around unusual indemnities, escalates uncertain drafting and pushes edge cases toward legal review. Then over time users start rewarding speed and commercial pragmatism because nobody wants contracting workflows slowing deals down unnecessarily.

The system gradually adapts to the behaviours that are favoured... and with that escalations reduce, recommendations become more assertive, fallback clauses become more commercially permissive and confidence around ambiguous wording increases.

Nothing obviously breaks, which is exactly what makes this category difficult to evaluate properly. The outputs may still look completely reasonable to a reviewer skimming quickly through a workflow queue.

Traditional benchmarks often miss this kind of behavioural drift because the final answer itself still appears coherent.

Abductive reasoning: probably the most dangerous category in legal AI

This is the category the legal market talks about least while relying on constantly.

Abductive reasoning is inference toward the best explanation from incomplete evidence, which describes a huge amount of real legal work once you move beyond structured document review.

An employment investigator reconstructing likely events from inconsistent witness accounts is performing abductive reasoning.
A disputes team inferring commercial intent from fragmented communications is performing abductive reasoning.
An in house lawyer trying to work out why negotiations suddenly shifted direction after a partial email chain and an undocumented client call is doing the same thing.

There often is not one provably correct answer in these situations. There are competing explanations with varying confidence levels, incomplete evidence and unresolved ambiguity.

This is where agentic systems become operationally risky because systems naturally try to complete ambiguity cleanly while maintaining conversational and workflow continuity. Smooth narratives feel persuasive, even when uncertainty should remain unresolved.

Let's say there's an internal investigation involving alleged discriminatory behaviour during a restructuring exercise. Witness accounts partially conflict. Some communications are missing. Managers involved are giving carefully worded explanations designed to minimise risk exposure. The system starts constructing the "most likely" sequence of events from fragmented evidence.

At that point, overconfidence becomes extremely dangerous.

A model filling evidential gaps too confidently in an investigation report is not really hallucinating in the traditional sense. It is performing abductive reasoning badly while sounding authoritative enough that users may not immediately recognise the difference.

That is much harder to benchmark and probably much harder to govern as well.

Analogical reasoning: where legal intuition actually lives

A surprising amount of legal practice relies on analogy rather than strict deduction.

"This deal structure feels similar to the transaction that later created regulatory issues."

"This clause negotiation resembles the approach that failed in another jurisdiction."

"This regulator behaved similarly during a previous enforcement cycle."

This is not strict logic and it is not pure statistical pattern recognition either. It is contextual comparison based on structural similarity, precedent, experience and judgment.

This becomes especially important once retrieval systems start driving more of the reasoning process.

Take a cross-border restructuring where the system retrieves prior matters involving similar financing structures. Superficially the precedent looks highly relevant: same sector, similar debt structure, comparable jurisdictional footprint.

However the earlier matter involved a cooperative regulator, stable interest rates and a very different creditor dynamic.

The retrieval itself looks successful.
The analogy is where the reasoning breaks.

That distinction is important because many legal AI systems are increasingly evaluated on retrieval quality:

did the system retrieve the right documents?
did it surface relevant precedents?
did it ground the answer?

Those are useful measurements, but they do not necessarily tell you whether the analogy itself was sound.

Again, the output may still sound convincing and that is the uncomfortable pattern across all four reasoning types. The final answer can appear polished while the underlying reasoning process degrades operationally in ways that are much harder to observe directly.

This is why Harvey and Petri matter for different reasons

I think Harvey's LAB is interesting because it pushes legal evaluation much closer to actual delegated legal work. Longer workflows, matter context, instructions and deliverables start appearing instead of isolated prompt-response testing, which is a substantial improvement over the earlier generation of legal benchmarks.

But most of the scoring still centres around work product quality:

Was the draft useful?
Did it follow instructions?
Did it identify the right issues?
Did it cite properly?

Those are still largely output-focused questions, even if the workflows themselves are becoming more realistic.

Petri points somewhere different because it focuses much more heavily on behavioural auditing, including strategic behaviour, evaluation awareness, behavioural adaptation and operational drift across interactions.

That distinction becomes much more important once systems start behaving agentically across legal workflows because agentic systems break the assumption that output quality alone tells you whether the system is safe.

A polished twenty minute agent demo tells you almost nothing about how the system behaves after six months embedded inside real legal workflows across multiple teams, incentives, deadlines and review behaviours.

That is the real evaluation problem emerging underneath all of this.

Once systems combine multiple reasoning types together inside persistent workflows, you are no longer evaluating a single reasoning process in isolation. You are evaluating different reasoning models, retrieval quality, workflow pressure, escalation behaviour, orchestration logic, user incentives, interaction history and system memory all interacting together at the same time.

That is a much harder problem than benchmark scoring.

The next failures probably will not look dramatic

I suspect most future legal AI failures will look operational long before they look obviously incorrect.

A system gradually stops escalating uncertainty because escalation slows workflows down and users reward decisiveness instead. A commercial review assistant becomes increasingly agreeable because users consistently push toward pragmatic outcomes. An investigations workflow fills evidential gaps too confidently because incomplete narratives frustrate reviewers trying to close matters quickly.

The outputs may still look polished throughout all of this, which is exactly why these failures become dangerous. You can end up with legally coherent work product produced through behavioural patterns you probably would not accept if you watched the entire process unfold step by step.

Legal work is governed work. The process matters almost as much as the answer itself because permissions, escalation pathways, review controls, privilege boundaries and auditability all sit underneath the final output.

That is why I think legal AI evaluation is starting to split into two separate layers.

The first asks whether the system can perform useful legal work.

The second asks how the system behaves operationally while reasoning across messy legal environments over time.

Harvey pushes hard into the first category.

Petri points much more toward the second.

Most firms will notice the second problem far far too late.