Computer Use Agents Are A Bridge, Not The Architecture

A recent benchmark from Reflex compared browser-based computer use agents against structured API access for the same operational task. The difference was substantial. The AI driven computer-use path required dramatically more steps, tokens, latency and cost, while also introducing reliability failures caused by interface visibility and navigation state. The API based path, unsurprisingly, was faster, cheaper and far more deterministic.

The current excitement around computer-use agents is understandable. Instead of rebuilding infrastructure upfront, the model simply uses the same systems your teams already use. It opens applications, moves between tabs, clicks buttons and navigates workflows directly through the interface.

That flexibility is genuinely useful in fragmented enterprise environments.

At some point though, repeatedly asking a model to visually navigate the same workflow starts to resemble asking AI to generate the same dashboard from the same Excel document from scratch every Monday morning.

Initially that flexibility feels efficient and fun, over time though, the stable parts of the workflow should be extracted into structured systems, APIs and explicit logic because it is cheaper, faster and easier to govern.

That does not make the AI useless, but it changes where the AI sits in the stack.

Human interfaces were built for humans

Most legal systems were designed around a person sitting in front of a screen.

Tabs exist because people need visual separation. Scrollbars exist because screens cannot show everything simultaneously. Hidden menus, expandable sections and modal workflows exist because interfaces are designed around human navigation patterns.

Staff already know where the awkward parts of the workflow are, so a lot of the friction becomes normalised over time.

AI systems do not experience interfaces that way.

A computer-use agent is reconstructing state from screenshots, OCR, DOM fragments, cursor position and partial context windows. It is repeatedly trying to determine:

what changed since the last screen
whether the page finished loading
which elements are interactive
whether information is hidden
whether an action succeeded
whether the visible state reflects the actual system state

Most of the work there is navigation and state reconstruction, not legal reasoning.

The Reflex benchmark included a simple but important failure mode: the agent missed pending reviews because they were below the fold and not visible in the captured viewport. That kind of issue is not unusual, as it's a direct consequence of relying on visual interfaces as the coordination layer.

You might think the answer is simply better agents

A reasonable response is to assume the models just are not capable enough yet.

Perhaps future systems will scroll more intelligently, track state more accurately, maintain stronger memory or recover from interface ambiguity more effectively. Some of that will happen but even if computer-use agents improve dramatically, the economics still point in the same direction.

An AI navigating a graphical interface is performing a large amount of compensating work that would not exist in a properly structured environment. The model is repeatedly interpreting pixels to rediscover state that the underlying system already knows internally.

The AI ends up compensating for process and system problems that already existed and structured APIs remove most of that ambiguity entirely.

The question is not whether AI can eventually use interfaces competently. The question is whether interfaces are the right abstraction layer for machine coordination in the first place.

Legal work already suffers from fragmented state

Law firms already coordinate work across disconnected systems:

DMS platforms
email
matter systems
billing systems
task trackers
spreadsheets
regulatory portals
client extranets
Word documents
PDFs
photos of documents on a screen...
approval workflows

A surprising amount of legal coordination work exists purely because information and responsibility are fragmented across those environments.

Adding AI that visually navigates those same fragmented interfaces does not remove the problem. In many cases it amplifies it.

The agent inherits every inconsistency already present in the operational layer:

permissions that differ between systems
inconsistent matter naming
hidden workflow assumptions
missing metadata
unclear ownership
duplicated documents
untracked approval state

This is why many AI demos look convincing at small scale and fragile at operational scale.

Structured systems change the economics completely

The more important insight from the Reflex benchmark is not really about tokens or inference cost. It is about representation.

When work is exposed through APIs and explicit workflow state, the AI no longer needs to infer reality from pixels.

Instead of visually inspecting a matter dashboard, the system can query:

{
  "matter_id": "MAT-2041",
  "status": "Awaiting Partner Review",
  "pending_reviews": 3,
  "risk_level": "High",
  "client_visibility": true,
  "execution_gate": "Approval Required"
}

That shift changes almost everything, as latency drops because the model is no longer re-parsing interfaces, so cost drops because the context becomes compact and machine-readable.

Reliability improves because hidden workflow information becomes directly accessible. Auditability improves because actions can be tied to workflow events rather than reconstructed from screenshots and prompt logs.

Most importantly, governance becomes enforceable.

This is where orchestration actually matters

A lot of legal AI products still position orchestration as which model should handle this task?

That framing is far too narrow, because real orchestration is about coordinating:

workflow visibility
permissions
routing
review requirements
execution gates
evidence capture
cost controls
escalation logic
matter context
external system access
policy enforcement

The model becomes one component inside a wider operational system.

A legal AI platform should not behave like a human clicking around software. It should behave like infrastructure coordinating legal work intentionally. That distinction is important because the risk profile changes completely once outputs become operational.

Generating a summary is relatively low risk.

Triggering a filing, updating a clause position, notifying a regulator or sending advice to a client are operational acts. Those require reliable workflow context, policy checks and clear accountability.

That becomes difficult when the system itself only understands the environment through screenshots and cursor movements.

Computer use still matters, but mostly as a bridge

This does not mean computer-use agents are useless, they are extremely valuable where structured access does not exist (and maybe never will).

Many firms still depend on legacy systems with no APIs, external regulatory portals, client-owned platforms and operational processes that cannot realistically be rebuilt overnight.

In those environments, computer use acts as a compatibility layer, as it allows automation to interact with systems that were never designed for machine coordination.

That is super useful, but once a workflow stabilises, firms tend to stop wanting the AI to rediscover the same process through screenshots every day.

The stable parts usually move toward APIs, direct integrations and explicit coordination layers because the operational overhead becomes too obvious to ignore.

One of the more interesting ideas is that AI agents are indirectly pushing organisations back towards disciplined software engineering practices.

You can already see why.

If systems expose:

structured APIs
accessible interfaces
deterministic workflows
strong metadata
clear permissions
documented operations
reliable workflow state

AI systems become dramatically cheaper, faster and more reliable to operate.

For years, many enterprise systems survived because people compensated operationally for inconsistent design and fragmented workflows, AI systems are far less forgiving (on your wallet and your governance).

Without the structure underneath, the AI layer becomes difficult to trust operationally.