The Real Audit Gap in Legal AI: How Models Fail to Follow Instructions

The Real Audit Gap in Legal AI: How Models Fail to Follow Instructions

In the world of legal tech, we’re increasingly relying on large language and reasoning models (LLMs and LRMs) to support complex tasks like contract review, due diligence analysis, and jurisdictional comparisons. The promise is that these models can not only output a result but also reason through the steps. That reasoning trace is important for transparency, auditability, and defensibility.

The study from Together AI takes a deep dive into exactly this aspect: how well reasoning models follow user instructions in their reasoning trace (not just the final answer). The core finding: these models often fail. More than 75 % of the time across some tests they didn’t adhere to user instructions during the reasoning process.

For legal teams and legal tech builders, that’s a bit of problem. It means that even if the output looks fine, you can’t assume the internal reasoning is compliant, structured, or auditable in the way you intended.


How Together AI tested reasoning integrity

  • Introduced a benchmark dataset called ReasonIF, with 300 problems from math and science domains (GSM8K, AIME, ARC-Challenge) combined with instructions controlling language, formatting, length, and structure.
  • Six instruction types were tested: multilingual constraints, word limits, disclaimers, JSON formatting, uppercase only, and removal of commas.
  • State-of-the-art open models such as GPT-OSS-120B, Qwen-3-235B, and DeepSeek-R1 were evaluated and compared on how well they followed instructions in their reasoning versus final answers.

The results showed that instruction following within the reasoning trace was far weaker than in final responses and that compliance dropped even further as tasks grew more complex.


What this means for law firm AI adoption

Most law firms and in-house teams are betting on models that can reason, explain, and justify their answers. That’s the foundation of trust.

If those reasoning steps are unreliable or drift from the given instructions, the issue isn’t just technical, it’s operational and regulatory. It changes how you should design, test, and govern every AI assisted workflow.

Loss of trace-level integrity

In regulated or legal environments, we often demand not just a correct answer but an auditable trail: how the model arrived at its reasoning, what steps it took, what assumptions it made, and what jurisdictional points were triggered.

If the model ignores instructions that govern how we want the reasoning trace formatted, structured, or governed, then the trail loses integrity.

Erosion of trust and defensibility

If reasoning is opaque or deviates from the instructions, we can no longer treat model outputs as reliable for compliance purposes.

For example, you might ask a model to produce the reasoning steps in bullet format, include statute references, and summarise jurisdictional impact at each step. If it ignores those instructions, downstream users may think they are seeing a compliant audit trail when they are not and for me that’s a serious risk, especially in legal ops or eDiscovery where defensibility matters.

Harder tasks = lower compliance

The study shows that as tasks get more complex, instruction-following falls off. In legal tech, the tasks rarely stay simple. Multi-jurisdictional, multi-document, multi-issue workflows are the norm.

That means the weakest part of the model, instruction obedience is actually the one most stressed in real use cases. We shouldn’t be lulled into thinking “it works on simple tasks so it will work everywhere.”

Governance, vendor assurances and model selection

Legal tech buyers must ask vendors: how does your model perform in the reasoning trace under instruction-constrained formats? Not just final answer accuracy.

If the vendor cannot provide visibility (or the reasoning trace is inaccessible), then you’re entering a blind spot. You’ll also want to see if they fine-tune or generally optimise for reasoning-instruction adherence.


UX and workflow design

From a builder’s perspective, assume the model will sometimes ignore format or instruction constraints, and plan for it. The goal isn’t to over engineer safeguards, but to make sure the system still behaves predictably when it drifts.

This can include:

  • Pre- and post-processing checks on the reasoning trace (for example, did the steps include jurisdiction tags, or was the format valid JSON?)
  • Human-in-the-loop verification for high-risk tasks
  • Logging reasoning steps and comparing them directly to the given instructions
  • Using structured templates rather than free-form reasoning when compliance is non-negotiable

So what should you be doing?

  1. Include reasoning-trace instruction following in your model evaluation criteria. When piloting or buying a model, ask for specific tests: tasks with format constraints, word limits, multilingual instructions, JSON outputs, etc. Measure how well the model follows them, not just whether the final answer is correct.
  2. Treat the reasoning trace as a first class artefact of governance, build your AI governance framework to cover not only "does the output comply?" but also "did the reasoning process comply with instruction and audit requirements?" If you can’t inspect the reasoning trace, you weaken your governance.
  3. Design workflows for failure and divergence and accept upfront that models will occasionally deviate from instructions. Build fallback controls like human checks, template enforcement, and alerts when reasoning traces don’t follow format. This matters most for privileged work such as contract drafting or regulatory analysis.
  4. Prioritise models or fine-tuning that improve reasoning instruction adherence, the study indicates that reasoning-instruction fine-tuning (RIF) can improve compliance, though there is still large room for improvement. Legal tech teams should prioritise vendors who optimise for this behaviour or build internal fine-tuning pipelines.
  5. Raise expectations internally, if you’re running an internal GenAI tool across your legal ops or legal teams then communicate limitations clearly. The model might ignore part of the instructions, especially when reasoning is long, multi-step, and complex. Don’t treat it as a black-box oracle.

The Together AI study is a timely reminder that correctness and compliance are not the same thing. It moves the focus from "did the model get the answer right?" to "did it follow the instructions throughout its reasoning?"

Now I'd say that distinction is critical, because if the reasoning is opaque or misaligned with your instructions, then compliance, auditability and operational control are at risk.

For law firms, legal services teams and vendors, the message is pretty obvious: you cannot treat reasoning traces as incidental. They are part of your governance, your audit trail, and your assurance that AI is doing what you asked, not just what it produced at the end.

Building in visibility, template enforcement, and fallback controls should now be baseline practice.