Carbon Tags for Legal AI: It’s Time We Knew the Cost of Every Click

AI isn’t sitting quietly in the background anymore. It is embedded across legal work: document generation, clause extraction, matter analysis, research, timelines, and internal queries. That change has been rapid and so has the scale.
A few pilot queries quickly become thousands of daily model calls across matters, clients, and teams, the environmental cost is not theoretical.
Mistral recently released the clearest lifecycle environmental report we have seen from any AI provider. The report covers training, inference, and resource depletion. The figures are public, detailed, and independently verified.
Here’s what they shared:
- 20,400 tonnes of CO₂e to train their largest model
- 281,000 m³ of water used
- 660 kg of antimony-equivalent resource depletion
- 1.14g CO₂e and 45ml water per 400-token output
The majority of the impact sits in training and inference. Hardware production and infrastructure contribute much less. Every time a model generates text, there is real-world cost involved.
Most vendors aren’t training models which is fine
Legal tech vendors rarely train foundation models. Most use hosted APIs from providers like OpenAI, Claude, Gemini, or Mistral. This limits their visibility into data centre energy mix or training regimes. They still control how those models are used.
Vendors decide which models are called, how often, and with what context. They choose whether to batch, cache, or re-run queries. These choices shape environmental impact. That makes the cost measurable, and that means it can be tagged.
What should be tagged?
Start with the obvious. Any AI-assisted output should carry:
- Estimated CO₂ per item
- Estimated water used
- Optional: model used, number of calls, and region (if known)
Tagging applies to more than just generated documents. Include summaries, timelines, clause comparisons, internal Q&A outputs, structured data views, and chat exports. Anything surfaced to users or passed to clients qualifies.
Showing small numbers is still valuable. Visibility builds context. Context enables better choices.
Estimating usage is easier than most teams think
You don’t need perfect logs or internal telemetry from model providers to get a credible estimate. Most of the information already exists in your own system.
Start by understanding what your product or workflow sends to the model:
- How many tokens are passed in per call?
- What’s the average number of calls per user action?
- Are responses cached, or regenerated each time?
- Do prompts vary between tasks, or follow standard templates?
Once you know what’s going in and how often, the rest is arithmetic.
Use published benchmarks as a base. Mistral gives a clear figure: 1.14g CO₂e and 45ml water per 400-token output. OpenAI and others don’t publish this consistently, but third-party researchers offer reasonable proxies. If you're using hosted APIs, you can apply a per-call estimate based on average context length and model type.
Here’s how this could look in practice:
Example 1: Clause extraction from 20 documents
- 100 inferences using GPT-4
- Estimated 1.5g CO₂e and 50ml water per call
- Total: 150g CO₂e and 5L water
Example 2: Timeline generation across 3 jurisdictions
- 40 inferences using Claude 3 Sonnet
- Estimated 0.8g CO₂e and 30ml water per call
- Total: 32g CO₂e and 1.2L water
Even if those figures aren’t perfect, they are directionally sound. They allow you to compare tasks, refine behaviour, and show users what’s happening under the surface.
This also works in reverse. If you know the number of documents produced or views generated, and the backend logs show how many API calls were made, you can associate each output with a rough but defensible footprint.
Systems that already monitor latency, retry rate, or token usage can fold this in with minimal effort.
Now I'm not asking for gram-level precision, but a clear estimate, applied consistently, gives legal teams a starting point. That’s all tagging needs.
Why many systems skip RAG and go straight to full-doc context
In conversations with more than ten or so vendors and in-house engineering teams over the past few months, a pattern emerged. Most are not using full RAG pipelines in their legal workflows. Instead, entire documents or document bundles are passed directly into the model context.
This works and the outputs are often more accurate. It is simpler to implement. It performs consistently. The legal domain demands reliability, and few teams want to risk degraded answers due to poorly tuned retrieval.
This is understandable, as solving retrieval in legal requires major engineering effort and detailed domain knowledge. The fallback is easier. However, this approach has a footprint. Every oversized prompt means more compute, more emissions, and more water.
The industry has avoided solving this problem properly for too long. Mistral’s report gives us a new reason to revisit it.
This needs proper investment. Engineers can’t solve it in isolation. Progress will come from cross-disciplinary R&D with legal professionals who understand document nuance, academic partners who bring rigour and retrieval theory, and engineers who can ground it in usable product design.
The goal isn’t to reinvent search. The goal is to build legal-specific retrieval that works, holds up under pressure, and doesn’t require defaulting to 50-page brute force prompts.
What builders can do now
- Track inference volume by task, model, and user
- Apply per-inference estimates from trusted benchmarks
- Store and surface CO₂ and water usage tags alongside outputs
- Cache responses that are reused across users or matters
- Offer smaller model options for lighter-weight tasks
- Make usage transparent to users and platform owners
All of this can be implemented in the application layer. No deep model internals required.
What buyers should be asking
- How many inferences are made per matter, per document, per user?
- Which models are being used? Are they appropriate for every task?
- Do you provide carbon and water usage tags on outputs?
- Can we see usage summaries across clients or matters?
- Are models being re-invoked unnecessarily?
These questions are not about blame. They are about clarity. Any vendor who has measured their system's behaviour should be able to answer.
Legal tech tracks everything else, why not this?
Legal tech systems are already packed with metadata. Documents carry version history, authorship, client IDs, matter codes, risk scores, jurisdictions. Many tools already label outputs with flags like "high risk," "ready for review," or "client visible."
Adding one more metadata item, estimated carbon or water impact is not difficult. It sits comfortably alongside the rest.
ESG teams are asking about digital impact and we're seeing carbon figures in RFPs. This kind of tagging isn't a future cool thing to do, it's part of where enterprise expectations are heading.
If your platform is already parsing a document, classifying its risk level, and extracting obligations, then you already understand its content and structure. Adding a footprint estimate based on model usage is a light lift realistically. The same system that identifies which clauses are missing can also log how much compute was spent getting there.
Environmental tagging doesn’t require a huge redesign, it fits into the data layer most teams already have.
Legal teams are often optimising for the wrong metrics
AI tools in legal are typically judged on speed, accuracy, time saved, or user satisfaction. Those metrics matter, but they ignore one thing entirely: environmental cost.
Every model call draws power, consumes water, and adds to the firm’s footprint. That usage is rarely tracked, even though it scales with every client and matter.
Governance tends to focus on what models say. It should also track what they consume. If your team builds or buys AI tools, then you're responsible for how efficiently they run.
Environmental metrics belong in the same place as performance and accuracy, not as an afterthought, but as part of the brief.
I’m currently in a hosepipe ban area. Like many others in the UK, I am choosing which plants to water, so pollinator-friendly wildflowers get priority, everything else waits.
That contrast is hard to ignore.
Legal AI systems are pushing 50-page PDFs into large models, sometimes dozens of times a day. Retrieval is often skipped. This isn’t due to laziness, as the fallback works. It performs well and keeps the user confident in the result.
Even so, it makes me pause.
Legal work still needs to be accurate, safe, and efficient. That remains the priority. Environmental impact tracking does not undermine that. It gives us more data to design better systems.
If I can weigh up whether a lavender plant gets watered today, then it makes sense to weigh up whether a task really needs another inference.
Start with visibility. Tag it. Track it. Let the data speak because everything else can follow from that.
Note: The imagery in this post was generated with AI. I offset all my AI usage, including content creation like this. If you’re looking to do the same, projects likeProtect.earth are a good starting point if you're in the UK. Plenty of others exist but the important part is doing something, not waiting for perfect.