Beyond Right and Wrong: Rethinking AI Evaluations in Legal Work

Most firms still evaluate AI in legal through a narrow lens: did the model give the right answer? That’s a benchmark, but it's not enough. Legal practice doesn’t run on binary correctness. Answers can be technically right but still unusable if they breach confidentiality, contradict internal style, or fail to align with client expectations.

If you stop at accuracy then you’re benchmarking for mediocrity.

Accuracy is only the beginning

Take due diligence. A model might correctly identify the borrower in a loan agreement. Job done? Not really...

Consistency: it flags similar covenants differently across the pack, so some appear as high risk and others slip through unmarked.
Style and structure: it outputs long narrative text where a three-column table was required, so it can’t be pasted into the diligence report.
Compliance: it pulls personal data into free-text notes, tripping privacy controls.

The answer might be accurate on a single document, but across a workflow the delivery still fails the user.

Evaluation as more than QA

Evaluation isn’t a box-ticking exercise. Done well, it’s a way of really embedding professional standards into AI workflows. Think of it much like a performance review for a junior associate. Accuracy is expected, obviously, but consistency, judgement, tone, and compliance are what decide whether you trust them with client work.

Platforms like Latitude are already set up for this. They let you define structured evaluations across multiple dimensions: factuality, coherence, style, bias, compliance, even firm-specific writing guides. They support different methods such as LLM-as-judge, programmatic checks, human review and they allow for negative evaluations where the goal is to minimise harm, like hallucinations or toxic content.

That’s the concept we need, evaluation not as a one-time test, but as a living framework for professional standards.

Building an evaluation library for legal

Just as firms maintain precedent banks, they should maintain evaluation libraries. These capture not only accuracy but the broader qualities that make outputs usable.

Here are categories firms should start look at bow, each of them already measurable with today’s platforms:

Evaluation Name	Description	Type	Metric / How Measured	Legal Use Case
Factuality	Ensures case law, statutes, and citations are correct.	Binary / LLM-as-Judge	Pass/fail, golden dataset check.	No hallucinated authorities in advice or submissions.
Instruction Adherence	Checks if output follows exact prompt (format, tone, requirements).	Rating / LLM-as-Judge	Score based on alignment.	Client-facing reports, redlines, memos.
Style Alignment	Matches firm or client style guides.	Rating / Custom rubric	Style score vs guide.	External advice letters, regulatory filings.
Consistency	Similar inputs produce uniform outputs.	Binary + Rating	Cross-document variation metrics.	Due diligence packs, contract reviews.
Compliance & Ethics	Output respects regulation and avoids bias.	Binary / Negative evals	Rule-based filters + LLM judge.	GDPR, privilege, discrimination risks.
Explainability	Response shows reasoning or sources.	Rating	Clarity of reasoning, source citations.	Litigation strategy, defensibility in audit.
Helpfulness	Is the answer actionable and relevant?	Rating	User/human feedback.	Drafting support, client queries.
Format & Structure	Checks if output is in the required format (tables, clauses, JSON, etc).	Binary + Programmatic	File/output schema check.	Contract automation, NDA generators.
Lawyer Satisfaction	Measures if lawyers actually used the draft or binned it.	Rating / Human-in-the-Loop	Edit-distance from final version; quick survey.	Adoption tracking for internal pilots.

A few examples

Helpfulness
A client asks: "What are the key risks in this joint venture agreement?"
- An unhelpful output might simply repeat sections of the contract without prioritising.
- A helpful output identifies three key risks, explains why they matter, and suggests mitigation strategies.
- Evaluation: scored on a scale (e.g. 1–5) by LLM-as-judge, with partner sign-off for edge cases.
Style Alignment
The firm’s style guide doesn’t just care about tone, it cares about structure. Advice to a client should be concise, put the conclusion up front, and only then explain the reasoning.
- A misaligned output: long narrative text that buries the answer on page two.
- A compliant output: a simple opening line "This clause exposes the client to X risk" which is then followed by structured reasoning.
- Evaluation: rubric checks whether answers lead with conclusions, are logically sequenced, and avoid unnecessary repetition.
Format & Structure
A partner asks for diligence findings in a table with columns for clause reference, summary, and risk rating.
- A misformatted output: long paragraphs without a table.
- A correct output: clean tabular format, ready to paste into the diligence report.
- Evaluation: programmatic check that table structure and fields match requirements.

These aren’t side benefits. They’re the difference between AI outputs being trusted or written off as just extra work.

Practical steps for firms

How do you make this real without drowning in theory?

Pick a high-volume workflow. Start with NDAs, redlines, or diligence extracts.
Define standards you already use. Ask: how would you evaluate a junior associate’s work here?
Codify them into evaluations. Platforms like Latitude let you combine programmatic rules with LLM-as-judge and human feedback.
Build a library. Create templates you can reuse across matters. Update them as client needs, regulator expectations, and firm policies evolve.
Automate where safe, escalate where risky. Let machines handle factuality or formatting checks. Keep human oversight on compliance and judgement calls.
Track performance and adoption. Don’t just measure accuracy. Track consistency, usability, and how often lawyers actually use the outputs.
Keep a record. Evaluations double as an audit trail, showing regulators and clients that AI outputs are systematically checked, not left to chance.

Firms that evaluate AI only on accuracy are treating it like a toy, whereas firms that build full-spectrum evaluation libraries are treating it like a professional.

The danger isn’t that current benchmarks are wrong, it’s more that they’re too small. They risk normalising mediocrity in client work.

Evaluations can go further. They can enforce the same standards you hold your people to, prove compliance in black and white and can look to build consistency at a scale no team of associates could ever maintain.