Where Is the Big Red Button? Why Every Law Firm Needs an AI Incident Plan

Where Is the Big Red Button? Why Every Law Firm Needs an AI Incident Plan

Over the past two years, law firms have invested significant time and effort into AI governance. Risk teams have reviewed security controls, procurement functions have assessed vendors, and technology teams have developed implementation roadmaps. Committees have been formed, policies have been drafted, and organisations have worked hard to establish frameworks for responsible adoption.

All of this is very sensible, all makes sense.

What is odd though, however, is that most of these activities focus on a single point in time: the decision to deploy.

The legal sector has become increasingly sophisticated at asking whether a system is safe enough to introduce into the organisation. Much less attention is given to a different question: how do we know the system is still operating as expected six months later?

That may sound like a subtle distinction, but it is becoming increasingly important as AI moves from isolated experiments into core business processes. A system that performs well during testing may behave differently over time as prompts evolve, source data changes, retrieval pipelines are modified, models are updated and user behaviour shifts. None of these changes necessarily create an obvious failure and in many cases the platform remains available, users continue interacting with it and work continues flowing through the system.

The challenge is that quality, predictability and reliability can change long before anybody notices.


Trust Requires Verification

A friend who worked as an engineer for a major betting company once described how critical event feeds were monitored within their platform.

The system expected a continuous stream of heartbeat messages confirming that external data sources were alive and functioning correctly. If those heartbeats stopped arriving for more than a few seconds, betting could be automatically suspended until the issue was understood. The logic was straightforward. If visibility into the underlying event had been lost, continuing to accept bets created unnecessary risk.

What was interesting to me was not the technical implementation, though the process was fascinating, it was the mindset behind it.

The organisation did not assume that because a feed was functioning earlier in the day it would continue functioning indefinitely. The feed was expected to continually prove that it remained healthy and operating within acceptable parameters and when that assurance disappeared, protective controls took over.

This approach is common across industries that operate critical systems. Financial institutions monitor transaction anomalies. Cloud providers monitor infrastructure health. Telecommunications companies monitor network performance. The assumption is rarely that systems will never fail, it is that failures, degradation and unexpected behaviour will occur, and organisations need mechanisms to detect them quickly.

Legal AI is increasingly reaching a similar point. As these systems become embedded within document review, knowledge management, matter triage, research and drafting workflows, firms need confidence not only that the technology worked during testing but that it continues to operate acceptably in day-to-day practice.


Vendor Assurance And Organisational Assurance Are Different Things

Much of the current conversation around AI assurance focuses on supplier due diligence.

Questions around security certifications, data residency, privacy controls, contractual protections and governance processes are important and should continue to be asked. Many legal AI vendors are investing heavily in evaluation, monitoring, citation validation and quality assurance because they understand that trust is essential for adoption within the legal sector.

The challenge is that vendor assurance and organisational assurance answer different questions.

A vendor may be able to demonstrate that a platform is operating as intended across its customer base. A law firm still needs to determine whether that platform continues to operate acceptably within its own matters, workflows, clients, risk tolerances and professional obligations.

Years ago I worked on an integration where our error handling logic relied on a specific error structure being returned by a third-party provider. At some point the provider changed that structure. The service itself continued operating, but our error handling no longer worked as intended. Nobody involved was negligent, I'm sure the provider had valid reasons for making the change but the integration had been built around assumptions that no longer held true.

It reinforced the importance of having your own monitoring in place rather than relying solely on the assumption that everything upstream would remain unchanged.

AI systems create similar challenges. A retrieval pipeline may begin surfacing less relevant material following changes to source content. A prompt modification may alter behaviour in subtle ways that are difficult for users to detect. A model provider may release a new version that performs differently on a firm’s specific legal tasks. A classification workflow may gradually become less effective as document types evolve.

None of these scenarios necessarily represent a vendor failure. The platform remains available, the security controls continue functioning, service level agreements continue to be met... yet the quality of legal work being produced may be changing.


Governance Without Visibility Is Incomplete

Governance frameworks define expectations. Operational visibility determines whether those expectations are actually being met.

Many organisations can explain why a system was approved, who signed off the procurement process and which committee oversees its use. I said much fewer can explain how they would detect a meaningful deterioration in output quality over the next six months.

A policy document cannot tell you that retrieval quality has declined and a governance committee cannot identify that a model update has introduced new failure patterns into a contract review workflow. An acceptable use policy cannot detect that a knowledge repository is becoming increasingly outdated or that users have begun developing workarounds that bypass intended controls.

All these require measurement, but more importantly, they require measurement that continues long after deployment.

The legal sector has historically been comfortable with periodic reviews because most traditional software behaves in a relatively predictable manner. AI systems change in ways that are often gradual rather than dramatic. Small adjustments can accumulate over time without triggering obvious alarms. Outputs may remain plausible even as consistency or quality begins to drift. Users often adapt to these changes without immediately recognising them. That makes ongoing observation just as important as initial approval.

The environment surrounding the system is constantly changing, even when the application itself appears largely unchanged.

A system that passed testing six months ago may still be functioning exactly as designed while no longer delivering the same level of performance.

The organisations that build lasting confidence in AI will therefore need to move beyond static governance artefacts and develop operational capabilities that provide continuous assurance.


Every AI System Needs A Heartbeat

The concept of a heartbeat is useful because it forces organisations to think carefully about what healthy operation actually looks like.

For legal AI, that heartbeat is unlikely to consist of a single metric. Instead, it will be made up of multiple signals that together provide confidence that the system remains within acceptable operating boundaries.

Those signals may include benchmark performance against representative legal tasks, citation validation rates, retrieval quality measures, escalation frequencies, user override patterns, latency trends, cost anomalies, prompt injection attempts and other indicators relevant to the specific workflow.

The precise measures will vary between organisations and use cases, but the principle remains the same.

The firm should be able to answer a simple question at any point in time: how do we know this system is still performing as expected?

Many firms can answer that question during procurement, though can they answer it a year after deployment?

Continuous evaluation becomes particularly important because technical health and business suitability are not the same thing. A platform can be fully available, secure and functioning correctly from the vendor’s perspective while simultaneously drifting away from the standards required by the organisation using it.

That is why evaluation is gradually becoming an operational control rather than simply a testing exercise.


Agentic AI Changes The Conversation

The common response to discussions around AI controls is that lawyers remain responsible for the work.

That is true, but it does not fully address the operational challenge that increasingly autonomous systems create.

In a traditional drafting workflow, a lawyer may review every output before it is relied upon. In an increasingly agentic environment, systems may classify documents, create tasks, retrieve information, update records, route work and interact with other systems before a human ever reviews the final outcome.

The lawyer remains accountable for the matter, but that does not automatically make them the monitoring function, the intervention function or the operational control function.

These are distinct responsibilities.

As organisations adopt more autonomous systems, they will need to think carefully about who observes system behaviour, who receives alerts when performance changes, who has authority to intervene and who ultimately remains accountable for outcomes. Treating these as the same role may prove increasingly difficult.

This is where discussions around human oversight often become imprecise. Accountability matters enormously, but accountability alone does not tell an organisation how issues are detected, escalated or resolved. A partner may remain accountable for a matter while having little visibility into emerging patterns across thousands of system interactions occurring elsewhere in the organisation.


Where Is The Kill Switch?

The phrase “kill switch” is perhaps slightly misleading because most mature systems are not designed around immediate shutdown. They are designed around controlled degradation.

If confidence in a system begins to decline, organisations may wish to disable autonomous actions while retaining recommendations. They may require additional human approvals. They may switch to fallback models. They may suspend specific workflows while allowing others to continue operating.

The important question is not whether there is a giant red button sitting somewhere within the organisation.

The important question is whether the organisation has already decided what happens when trust in the system begins to decline.

  • What signals trigger intervention?
  • Who receives those signals?
  • What authority do they have?
  • How quickly can action be taken?
  • What happens if costs suddenly spike due to an agent entering an unexpected loop?
  • What happens if evaluation scores fall below agreed thresholds?
  • What happens if a model provider changes behaviour in a way that materially affects a legal workflow?

These are operational resilience questions, and they deserve the same attention that organisations already devote to cybersecurity, disaster recovery and business continuity planning.



You Can Outsource Technology. You Cannot Outsource Assurance

One misconception that occasionally appears in AI discussions is the idea that assurance can be outsourced alongside the technology itself.

The reality is more complicated.

Suppliers should provide monitoring, evaluation, reporting and operational controls. Customers should expect them to do so. Many legal AI providers are already making significant investments in these areas because they understand the importance of trust and reliability within legal practice.

Yet organisations deploying these systems still need independent assurance that the technology continues to meet their own requirements.

This principle is hardly unique to AI. Organisations do not abandon security monitoring because cloud providers have security teams. They do not abandon business continuity planning because suppliers have disaster recovery capabilities. They do not abandon financial controls because software vendors provide audit logs.

AI should be viewed through the same lens.

This is also broadly consistent with the direction of travel in regulation. Responsibilities are increasingly distributed across providers, deployers and users. The existence of supplier controls does not remove the need for independent organisational controls because the organisation using the technology remains responsible for understanding whether it continues to operate acceptably within its own environment.

Ultimately, only the firm can determine whether a system remains suitable for its own clients, matters, obligations and risk appetite.


The legal sector often discusses trust as though it is established during procurement. In reality, trust is an operational property.

It emerges from continuous evidence that a system remains accurate, predictable and under control. It depends upon visibility into performance, confidence in monitoring processes, meaningful evaluation against real-world tasks and the ability to intervene when behaviour changes.

The next phase of AI governance is therefore unlikely to be defined by just another policy document or another supplier questionnaire. It will be defined by operational disciplines that allow organisations to understand what their systems are doing, identify when behaviour changes and respond before those changes become client problems.

Every law firm has a cyber incident plan, and as AI becomes embedded within legal operations, every law firm will need an AI incident plan as well.

The question is whether those plans are developed before the first significant incident occurs or after the fan has sprayed all the walls...