Tracing the Why: Opening the Black Box in Legal AI

Anthropic just released open-source tools for circuit tracing. At first glance, it looks like deep research. Niche. Maybe even irrelevant if you're building tools for lawyers. It’s not.

This is the start of something bigger: an actual move from intuition-based prompting to observable, verifiable reasoning inside AI systems, which in legal, where the cost of error is high and trust is non-negotiable, that change matters more than most realise

What Circuit Tracing Actually Does

Forget explainability that just generates more words. Circuit tracing lets you see how a model handles ideas. Not just the input and output, but the neurons and pathways lighting up between them. You can find specific behaviours like how the model recognises a signature clause, how it processes "notwithstanding," how it handles negation and then track them all the way through.

You’re not tweaking prompts and hoping for consistency. You’re analysing behaviour at the model level, so you can label, measure, compare and you can intervene. That’s entirely new territory for legal AI.

Why Legal Should Care

Most legal tech still runs on black boxes. We add guardrails. We run thousands of test cases. We cross our fingers that retrieval will keep hallucinations in check. But when something goes wrong, an incorrect governing law clause, a missed carve-out but there’s no real way to explain why. A prompt rerun isn’t explanation. It’s a guess.

Legal doesn’t need guesswork. It needs evidence.

With circuit tracing, you could:

Trace whether a model actually understands defined terms or if it just mimics examples
Identify when it's over-relying on surface features like party names or formatting
Map how it handles jurisdictional shifts, not just whether it gets them right

You get internal alignment, not just surface-level accuracy. That’s what makes this worth watching.

Seeing Model Reasoning, Step by Step

The public notebook from Anthropic shows how this works in practice. You:

Isolate specific behaviours using labelled activations
Patch or suppress them to test how much they contribute
Visualise how concepts flow through the model layer by layer
Run real experiments, not just prompts with different temperature settings

Imagine doing that with indemnities, warranties or choice of law. You could finally answer: what is the model actually picking up on?

It’s not theoretical. This is usable now, if you’ve got access to the model internals and plenty of time.

The Interpretability Imperative

Anthropic's CEO recently wrote The Urgency of Interpretability which spells out the problem. These models don’t work like traditional software. You don’t write explicit rules. You train on vast data, apply feedback, and hope useful patterns emerge. It’s not programming. It’s ecosystem shaping and that means we often don’t know exactly why a model makes a certain decision, we just know that it usually works.

That’s fine if you're generating tweets (or X's...?). It’s not fine if you're influencing decisions tied to liability, regulation or legal obligations.

You wouldn’t trust a junior lawyer who guessed their way through contracts. You shouldn’t trust a model that does either, and right now most explanations are surface-level: plausible-sounding text that’s generated using the same mechanism as the original answer. No introspection. No audit trail. Just more confidence.

Circuit tracing completely changes that. You don’t ask the model why it did something, you measure it, you test it and you patch it and see what changes. It’s the start of proper reasoning visibility. And that could be the difference between adoption and shutdown once regulation starts catching up.

Where the EU AI Act Comes In

The EU AI Act is already forcing the conversation on risk and accountability in high-impact use cases. For "high-risk" systems, especially those used in legal, compliance or decision-support roles, the Act will require transparency, traceability, and a documented understanding of how outputs are produced.

Right now, that’s still broad. There’s language around data governance, record-keeping, and oversight, but very little on actual reasoning visibility. Most providers will get by with logs, summaries, and confidence scores. But that won’t last.

As legal AI becomes more influential and more relied on, then you can expect those requirements to tighten. Not just what the model said, but why it said it and not just why according to another LLM, but why according to the underlying structure.

In a few years, this circuit-level transparency or something like it could be part of the standard. Not just for model developers, but for the firms deploying them into regulated environments.

When that happens, the teams who’ve built with explainability from day one won’t be scrambling to retrofit it. They’ll already be trusted.

The Catch

This kind of tracing only works if you can see inside the model. OpenAI, Claude, Gemini, none of them let you do that. Unless commercial providers open up the internals (and they won’t), your only option is self-hosted open-source models.

That means choosing: do you want current peak performance and zero visibility? Or slightly lower performance with the ability to inspect, validate and adapt?

For legal, where explainability may soon be a regulatory requirement, that choice won’t stay optional for long.

Where To Start

Obviously, you don’t need to rebuild your stack overnight, but you should start thinking about:

Model flexibility: Don’t hardwire yourself to one API. Build room to swap in open models.
Tracing experiments: Start with one clause type. Trace how a small model handles it. Learn what matters.
Trust design: Use insights from tracing not just to improve AI, but to train humans on what to watch for and when to intervene.

This isn’t about beating OpenAI on benchmarks. It’s about building the kind of system you’d stand behind in court.

I’m building a small proof of concept using open-source models to trace how they identify indemnity clauses. Simple setup, but already raising good questions about how generalisation works and where models lean too hard on surface patterns.

If you’re working in this space, or want to, you’re welcome to get involved. Legal, research, engineering, whatever you bring. There’s plenty here to build, test, and question.