Beyond Guesswork: Reducing Hallucinations in Legal GenAI Tools

GenAI has a hallucination problem, and in legal work, that's a liability, not a quirk. It's easy to laugh off a model citing fictional cases or attributing arguments to the wrong judge, but if a law firm repeats those errors to a regulator or in court, the consequences are very real. So how do we stop that from happening?
Start with the Obvious: Ask for Sources
If all you’ve got access to is ChatGPT and a case law database, whether that's down to budget, dev constraints or just the current setup, that’s still workable. You can use the LLM to improve how research is done without relying on it to invent anything.
One of the simplest but most useful things it can do is help you generate better search terms for the system you're already using, it gets you to a more focused search, faster.
Example prompt:
“I’m researching negligent misstatement in financial services disputes, decided in UK courts from 2015 onwards. Help me generate effective search terms or filters I can use in Westlaw UK or LexisLibrary to find relevant judgments, ideally from the High Court, Court of Appeal, or Supreme Court. Use terminology commonly found in UK case law.”
If you're using something like ChatGPT directly with search or deep research, make a habit of asking for source names. Every single time. "Give me five cases on X" shouldn't end with a list of plausible-sounding citations. It should end with five real case names, ideally with a short summary, and some honesty from the model if it’s unsure.
Example prompt:
"List five leading UK cases on misrepresentation in contract law. For each, include the party names, year, and a brief summary. If you're not certain a case exists, say so."
Encouraging this across the firm starts to build a culture of traceability. It also flushes out hallucinations earlier, before they make it into client work.
Don't Let the LLM Do the Research
LLMs aren’t legal databases. Please, stop treating them like they are. If you're asking about precedent, the LLM shouldn’t be guessing. Think of it more as the layer that sets up the search, adds the filters, maybe adds context, but it should never be where the content is coming from.
Here’s how you might apply that, by using the LLM to structure a more targeted query:
"Search for UK cases from 2015 onwards involving negligent misstatement in financial services disputes. Prioritise those heard in the Court of Appeal."
Then set up a flow like this:
- LLM breaks down the query and builds refined search strings
- A connector (e.g. using Model Context Protocol) passes it to a trusted database
- That database returns actual results
- LLM pulls those into context and explains what was found
You’re no longer relying on the model to get it right by itself, you’re giving it trusted material and asking it to work from there.
Retrieval Pipelines, Not Just Prompting
The structure of the system matters. Retrieval-Augmented Generation gives you a proper way to anchor what the model sees. A question goes in, documents or cases get pulled, and the model works only from that source material.
So you'd:
- Configure the search to focus only on your approved jurisdiction
- Keep the trail of retrieved documents visible to the end user
- Log the whole journey from query to document to output
That way, if something goes off, you can trace it. You’ll see what the model was given and what it produced, side by side.
Layered Design: Summary, Flag, Verify
Legal GenAI doesn’t need to do everything at once. In fact, it’s better when it doesn’t, you need to break the task down.
- Generate a clear, structured search
- Retrieve a small set of high-relevance results
- Summarise and flag what might be useful
- Leave the decision-making to a human
Here’s how you would apply that in a due diligence task:
- Upload 100 contracts
- Model finds and flags all change of control clauses
- Summarises each one briefly
- A human reviews the top 10 most unusual or inconsistent examples
You’re using the model as an analyst, not as a lawyer and that is the distinction builds trust.
Don't Assume One Model Does It All
Most failures come from asking one model to handle everything. The reality is, smaller, specialised models usually do better when scoped tightly. Don’t expect one tool to be judge, jury, and researcher.
In this case you'd want to:
- Use a fine-tuned model for detecting clause types
- Use a separate one to score relevance
- Use a general-purpose LLM for summarising only after you know the input is right
Link them together, track what goes in and what comes out. Keep it modular so nothing gets too opaque.
Legal Work Needs a Chain of Trust
Lawyers don’t just want answers. They want to know where those answers came from. They want to be able to check, validate, and explain. GenAI can absolutely help, but not if it’s used like a crystal ball.
Treat it like a junior researcher. Ask it to gather, filter, suggest, and summarise, but then give the decision to the person who knows what’s at stake.
We don’t need hallucination-free models. We need hallucination-aware systems.