Rethinking AI 'Understanding' in Legal Workflows

I was listening to The Joy of Why podcast whilst trying to get the baby to sleep, and one line really stuck with me and not because it was hyping the next big thing in AI, but because it asked a much older question: if we built these systems, why can’t we explain what they’re doing?

Ellie Pavlick, a computer scientist and linguist, made the point simply. Just because you write the code that kicks off the training doesn’t mean you understand the system that comes out the other side. A large language model isn’t just the sum of its architecture and training data. It is a learned statistical landscape, full of patterns, weights, and behaviours we didn’t hand-engineer. You don’t program it in the traditional sense. You raise it.

The analogy she used was baking. You can follow a recipe and understand what happens if you skip the baking soda, but you still can’t predict the exact texture of the cake before it’s out of the oven. That gap between recipe and result is exactly where we find ourselves with LLMs. We know the training objective (predict the next word), we understand the maths (mostly) but we still don’t know why the model just told you that a clause was unreasonable, or why it thinks a sentence “suggests intent to terminate.” There’s no traceable chain of thought. Just a probabilistic output shaped by billions of past word patterns.

That might be uncomfortable, but it might also be fine.

Humans aren’t transparent either. We like to think we can explain our decisions, but most of the time we’re just narrating instinct. Pattern recognition dressed up as reasoning and in legal work, people rely on that instinct constantly. A clause feels off. A sentence doesn’t land well. Something just reads wrong. No one’s auditing your neural pathways for traceability.

So why do we expect machine reasoning to meet a higher standard? A question I've posed before.

Not magic. Not conscious. Still useful.

There’s a tendency in legal AI conversations to frame these models as either magical or misleading. They’re either going to change everything or they’re just making things up, obviously the truth sits somewhere between those extremes. They are tools that behave in weirdly useful ways.

LLMs don’t understand clauses the way lawyers do. If you prompt them well and give them clear instructions, they can do a passable job of sorting, classifying, and summarising. That’s not trivial. That is the kind of work that often gets assigned to a junior team member or an offshore process. It is time-consuming and error-prone, even for people.

The podcast made one particular point that stuck with me. Pavlick pushed back on how we use words like "know", "think", or "understand". We use them all the time when we talk about AI, but they are vague at best. When we say the model "knows this is a termination clause", what we really mean is that it behaves as though it does because it produces a reliable and relevant result.

In legal tech, that’s the question we should focus on, not whether the model understands the contract, but whether it behaves in a useful and consistent way when asked to analyse it.

What models don’t know (and what we pretend they do)

LLMs can’t explain themselves. When you ask why they said something, their answer is just another prediction. They are not reasoning about their own thinking. They are guessing what words might come next after a question like "why did you flag this clause?" It might sound persuasive. It might sound like insight. But it could just as easily be a performance.

That is not a dealbreaker if you treat the output properly. It becomes a problem only when we assign weight to the explanation and treat it as justification.

This is especially important in legal workflows, there is a real tendency to treat explainability as a binary. If we can’t trace the decision back step by step, we reject the tool outright. In some contexts that might be necessary but in so many others, it is not. The real test should be whether the model produces reliable, observable behaviour under defined conditions.

That’s not blind trust. That is how we handle human judgement as well. Most of the time, we validate outcomes through review, precedent, or repeatable patterns, not through neurochemical introspection. (though I doubt risk would sign off on that...)

Language loops and legal influence

The episode also touched on something subtler. Language models don’t just reflect us. They shape us, people are already learning how to write prompts, how to phrase questions, how to "talk to" the machine, similar to how we changed our phrasing for Google searches. This starts to feed back into how we write, how we email, and how we interact in professional settings.

Legal teams are not immune to this. If your tool is trained on legacy templates, it will reproduce legacy thinking. If your clause bank is bloated, expect bloated outputs. If you use AI to write legal documents, those documents will in turn shape the future training data. It is a loop. One that needs to be actively managed.

There is real potential to use LLMs to simplify and improve legal writing. Shorter clauses, clearer phrasing and more human explanations. That only happens if you prompt the model accordingly, or better yet, tune it to those goals. If you let it mirror whatever came before, you’re just scaling past inefficiencies.

Different vendors, same foundation

Pavlick made one final point that matters for anyone rolling out legal AI at scale. It might look like the market is full of different systems, but in practice most of them run on the same few base models. GPT-4, Claude, Gemini, maybe a few open-source variants. Wrap a new UI around them, plug in a contract template, and now you’ve got a new tool.

That creates a risk. If your redline assistant, clause checker, and legal chatbot are all using the same foundation model, then a model-level issue could affect all three. If a vendor updates its underlying model without telling you, it might break your prompts or subtly change how the tool behaves.

Legal tech leaders need to understand what models sit behind the tools they use. Not just the vendor or the interface. The actual model. When it was last updated. Whether you can lock it to a specific version. What your fallback is if something shifts.

Governance matters. Not because the model will fail in a dramatic way, but because it might drift in a quiet one which goes unnoticed.

Maybe the black box isn’t the problem

We treat the black-box nature of LLMs as a critical flaw. Sometimes it is. Often, it is just a reminder that intelligence, could be human or machine, is not always legible. We can’t trace every decision a person makes. We rely on patterns. We run tests. We build trust through interaction, not introspection.

LLMs are not magic, they are not conscious and they are not reliable in the way we might like, but... they are also useful. Surprisingly useful. Especially in legal workflows where language is structure, argument, and power.

You don’t have to see inside the system to use it well. You just need to know what it does, where it breaks, and how to guide it.

That might be enough.

If you’re exploring how to bring LLMs into legal workflows, the goal isn’t to crack open the model and understand every detail. That’s not realistic, and it’s not the point. What matters is whether the system behaves in a way that’s consistent, useful, and trustworthy within your context, all this requires the right framing, the right questions, and an understanding of the risks.

So I'd suggest to keep these in mind when working with LLMs:

Behaviour over intent
Focus on how the model behaves, not whether it truly "understands."
Prompt design matters
Good results come from precise, structured inputs. Treat prompting as UI design, not magic spells.
Explanation ≠ justification
Never treat the model’s explanation of its answer as a reason to trust it. Review the outcome directly.
Trace the model lineage
Know what foundation model sits behind your tools, and whether it has changed.
Control training feedback loops
Avoid letting the model reinforce outdated templates or verbose style. Tune for clarity where possible.
Test, monitor, adjust
Run regular tests on known contracts. Monitor for behaviour drift. Adjust prompts or context inputs as needed.
Don’t ask for perfection
Ask whether the model saves time, adds clarity, or improves consistency. If yes, that’s a win.