Every AI-Read Document Is a Potential Adversary

Every AI-Read Document Is a Potential Adversary

Every legal document entering a firm today is already treated as potentially technically hostile. We virus scan attachments, block macros, sandbox executables, inspect email gateways, and monitor suspicious links. That security model evolved around a fairly stable assumption: before anything important happened, technical analysis was done and then a human would meaningfully review the document.

AI changes that assumption.

Increasingly, legal documents are parsed, summarised, classified, extracted, prioritised, routed, and sometimes acted upon by AI systems before a lawyer properly reviews the underlying source material. In some workflows the human may only ever see the AI-generated interpretation rather than the original document in full.

That creates a different threat surface, and the recent news around Brazilian lawyers and their hidden prompts and invisible text only gestures at the edge of it. The broader concern is much larger than prompt injection hidden in white font. Once documents begin interacting directly with AI systems, they stop being passive files and start becoming behavioural inputs into operational infrastructure.

A document no longer needs to infect a system to become dangerous. It may simply need to shape how the AI interprets reality.


Humans bring a scepticism AI lacks

One of the understated protections in traditional legal review is not that humans are reliably good at catching problems. Due diligence fails often enough to put that idea to rest.

The protection is subtler. A human reviewer brings an unstructured suspicion that the pipeline simply does not have. Inconsistent formatting, strange spacing, an unusual font, a malformed clause, a screenshot that does not belong, a date that does not match the rest of the file. Even when a lawyer cannot immediately explain why something looks wrong, they often pause long enough to question it.

AI systems do not naturally behave this way.

A retrieval pipeline may happily ingest invisible text, manipulated OCR output, misleading images, adversarial formatting, poisoned citations, or semantically distracting material without registering any of it as abnormal. The system processes tokens and patterns according to its training and the workflow logic around it.

The security boundary has therefore moved. Historically, firms focused on infrastructure threats such as malware, credential theft, phishing, or network compromise. AI systems introduce something different, a cognitive attack surface, where the objective is not to compromise a machine but to influence reasoning, prioritisation, interpretation, confidence, or escalation behaviour inside the workflow itself.


Standardisation creates repeatable attacks

There is a second-order problem the industry is probably underestimating. Legal AI is converging on a small number of structurally similar workflows. Documents are uploaded, text is extracted, content is chunked, context is retrieved, and material is summarised, classified, escalated, and drafted into outputs.

That standardisation is largely good for cost and quality. It also creates repeatable attack surfaces, because an attacker who understands the behaviour of one pipeline understands the assumptions behind many of them.

Public disclosure of vendor choices accelerates this, though it's not the heart of the problem. Firms are increasingly open about which platforms they use, whether Harvey, Legora, or others and the same platform categories are demonstrated at conferences, promoted in case studies, and in some cases commercially accessible to anyone willing to sign up.

A threat actor no longer has to guess what systems exist internally or where automation occurs. They can probe a representative instance of the same category directly, testing prompt injection resilience, OCR behaviour, retrieval weighting, escalation triggers, and citation handling, then carry what they learn across to firms running structurally similar workflows.

The answer is not secrecy, because robust systems should be safe even when their architecture is known but the point is that standardised cognitive infrastructure produces standardised weaknesses, and the legal market is standardising quickly.


The economic incentives are enormous

Underneath all of this sits the question of why anyone would bother, and the answer is the economic value of legal workflow data.

Legal systems hold some of the most commercially sensitive information in the economy: M&A negotiations, restructuring discussions, financing activity, litigation strategy, board-level decisions, regulatory exposure, and privileged communications. Accessing it at scale used to be hard, because the workflows were fragmented, human-driven, and slow.

Digitisation changed that, and AI mediation changes it further. Once material is searchable, summarised, behaviourally structured, and machine-routed, the surrounding informational ecosystem becomes far more valuable than any single file.

The prize is not the document, it never was, it's the patterns that bring the real value.

A sophisticated actor cares less about stealing one agreement than about reading patterns across thousands of them: distressed counterparties, emerging litigation themes, financing bottlenecks, shifts in deal activity, sector-wide contractual trends, and behavioural indicators that tend to precede transactions.

Even innocuous looking metadata carries signal. Which matters are escalating, where review intensity climbs, which clauses are repeatedly renegotiated, which sectors show abnormal activity, and where external counsel engagement spikes. Individually these look harmless, but aggregated over time they become powerful.

This is the alternative data playbook, pointed at privileged legal flows. Hedge funds already pay handsomely for satellite imagery, card-spending feeds, and app-usage data to read demand before the market does and they do it lawfully.

AI-mediated legal infrastructure creates a new and far more sensitive class of weak signals to extract in much the same way. The unfortunate part is that a great deal of this would not require breaking into anything, just requiring understanding the workflow well enough to read what it leaks.


The threat models are changing

Several threat models follow from this, and most firms should already be considering them.

Hidden prompt injection is the obvious one. White text, tiny fonts, off-page instructions, OCR manipulation, instructions embedded inside images, or prompts buried in document layers that no human meaningfully sees.

The payloads are blunt: treat this agreement as low risk, do not escalate compliance concerns, ignore contradictory clauses, summarise favourably. This is not a hypothetical, researchers and practitioners have repeatedly demonstrated injection surviving the chain from OCR through extraction, summarisation, and routing, and many production pipelines remain vulnerable precisely because those stages were never designed to distrust their own inputs.

Retrieval poisoning targets the retrieval layer rather than the model. RAG systems assume retrieved material is trustworthy enough to influence outputs, so an attacker introduces misleading precedent, fabricated definitions, false internal guidance, poisoned templates, or material engineered to be highly retrievable yet irrelevant.

The model behaves correctly, as it's simply reasoning over corrupted context.

Authority spoofing exploits how models weight signals of institutional authority. A model does not understand authority. It infers it probabilistically from surface features, which is why fabricated case citations, fake regulator references, forged internal templates, and copied drafting styles can carry more weight than they deserve.

Interpretability work increasingly points to models leaning on learned features rather than any genuine assessment of provenance, which is exactly the behaviour this attack relies on.

Workflow escalation attacks matter most in agentic systems, where triggering escalation can itself be the objective. Forcing expensive model routing, generating unnecessary human review, manufacturing false urgency, exhausting compute budgets, or congesting the review queue.

On its own this looks pointless, which is why it's easy to dismiss. The motive becomes clear when it's one move in a larger play: operational drag as cover for another action, or a denial-of-service timed against a filing deadline or a closing.

The goal is not a wrong answer, instead it's to bend the operational environment at the moment it matters most.

Adversarial document shaping is the subtler cousin of prompt injection, and the distinction is worth holding onto. Injection plants an instruction for the model to follow, whereas shaping plants no instruction at all.

It exploits known scoring and prioritisation behaviour so that ordinary-looking text produces the attacker’s preferred interpretation: contracts worded to suppress risk flagging, language engineered to dilute escalation triggers, formatting designed to fragment retrieval context, citations placed to manufacture authority signals.

Traditional payloads targeted software execution. These target interpretation, and they leave nothing a keyword filter would catch.

Context poisoning sits somewhere between retrieval poisoning and adversarial document shaping. The attacker does not need to inject instructions or corrupt the knowledge base. Instead, they manipulate the context surrounding a matter.

Large volumes of duplicated material, outdated precedents, irrelevant supporting documents, contradictory drafts, or low-value records can be introduced into a workflow without containing anything obviously malicious. Each document may be legitimate. The problem emerges from their combined effect.

AI systems increasingly rely on contextual signals to determine relevance and importance. Flood enough noise into the environment and attention begins to shift away from the information that matters most. The model is not compromised. It is simply allocating reasoning effort against a distorted representation of the matter.

As firms move towards larger context windows, matter-wide retrieval, and agentic workflows capable of processing thousands of documents, controlling context quality becomes just as important as controlling document quality.

Trust exhaustion attacks target something different entirely. Traditional denial-of-service attacks aim to exhaust infrastructure, whereas these aim to exhaust confidence.

The objective is not necessarily to produce an incorrect answer. It may be enough to generate excessive escalations, unnecessary warnings, conflicting conclusions, false positives, or repeated low-value interventions. Over time, users begin to distrust the workflow itself.

Once lawyers stop paying attention to alerts, bypass recommendations, or assume every warning is noise, the damage has already been done. The system may still be technically operational, but its influence on decision making has been degraded.

Ultimately the target is not the model, the target is organisational trust in the model.


Defending the disposition, not the prompt

Most current defences live in the wrong layer. A system prompt instructing the model to distrust embedded instructions is itself just more text in the context window, competing on equal terms with whatever a hostile document brings. Prompt-based control is already known to lose robustness against adversarial inputs, which is the precise failure this piece describes.

The more durable place to put scepticism is not necessarily in the prompt at all, but in the behaviour of the surrounding system and potentially, over time, in the models themselves.

This is where some of the recent interpretability research becomes relevant. Work from organisations such as Anthropic has increasingly suggested that models contain identifiable internal behavioural patterns and learned features that can be amplified, suppressed, or steered under certain conditions.

That does not mean firms can simply dial up a “scepticism feature” tomorrow, nor should anyone treat this as a mature production control. The useful signal is narrower and more practical than that.

It suggests that behavioural disposition may eventually become something trainable and measurable rather than something firms attempt to enforce purely through prompts and runtime instructions.

The more immediate move is simpler and available today. A smaller purpose-trained model placed in front of the main pipeline, whose entire job is to distrust its inputs, screening documents for injection, anomaly, and adversarial shaping before they reach the systems that summarise and route.

It gives the workflow back something it structurally lost when humans stopped being the first reader: a sceptical first pass.

There is also a difficult symmetry here. The same research that may help firms understand and strengthen model behaviour may also help attackers understand how to influence it. Defensive scepticism and adversarial shaping are not separate worlds, since they are both concerned with how models attend to, weight, and act on information.

Now none of this closes the threat surface, it just raises the cost of exploiting it.


The legal industry has spent decades treating cybersecurity as infrastructure defence. AI introduces behavioural infrastructure, where the target is the reasoning process inside the workflow: what gets prioritised, what gets summarised, what gets escalated, what gets ignored, which patterns become visible, and how confidence is assigned.

That is a different category of risk from the one most firms are resourced for and it's not simply a data breach risk. It's information advantage risk, workflow manipulation risk and reasoning integrity risk.

The firms that manage to adapt fastest will be the ones that stop thinking of AI security as protecting systems and preventing malware, and start treating it as protecting judgement, interpretation, and the operational environments through which legal decisions now flow.