When the AI Said Munch

In early 2025, the US Department of Veterans Affairs came under pressure to slash costs fast, a new federal unit called the Department of Government Efficiency (DOGE) was set up to take charge. One of its first experiments was an internal AI system built to identify contracts that didn’t "directly support patient care."

The tool itself wasn’t given a public name, so I’ll just refer to it here as Munch, because that’s exactly what it was designed to do: chew through contract records and spit out a list of what it thought should be cut.

Munch was built quickly, too quickly, handed a vague prompt, only the first 10,000 characters and then dropped into a system it didn’t understand. Within days, it had flagged dozens of contracts for cancellation. Some were clearly critical, cancer gene sequencing, surgical equipment, nurse hiring platforms. Others were misread entirely or hallucinated from thin air. One contract was flagged at $12 million, a figure Munch simply made up.

Staff on the receiving end were told to justify exceptions, in some cases, they had hours and the justification limit was 255 characters. Now Munch didn’t cancel contracts itself, but in practice it (along with the whole culture) created a pressure funnel. Push back too slowly, or with too little detail, and the contract was gone.

This Wasn't a Model Problem

Munch did what it was told, however it wasn’t trained on federal procurement rules or healthcare delivery systems, it also didn’t understand what "support" really meant in that context. It returned outputs that looked confident and the system treated that confidence like evidence.

The real failure was structural, there were no proper review paths, no safeguards and it seems no one asking whether the tool could do the job before it started flagging cancer diagnostics as waste.

It wasn’t a rogue system, but it was a rushed deployment with too much trust and not enough design, much like a well-meaning parishioner with a paintbrush, deciding to restore a crumbling fresco.

What Actually Went Wrong

1. You Can’t Prompt Around Complexity

The instruction was simple: keep only the contracts that directly support patient care. On the surface, that sounds clear. It isn’t.

Support in a healthcare system includes cleaning services, diagnostics infrastructure, HR platforms, procurement tools, none of these look clinical, but all are essential. Munch had no understanding of that nuance. It guessed and it did so with confidence.

This ties back to something I’ve covered before in the context of prompt design. If you don’t structure inputs with legal or operational clarity, you just end up with surface-level output that sounds right.

2. AI Doesn’t Get to Decide

Munch didn’t cancel anything directly. It generated a list, which landed on someone’s desk with a short deadline and barely enough room to explain why the AI was wrong. Many didn’t have time. Some didn’t push back. Others didn’t even see the flag before the contract was marked for termination.

That dynamic feels familiar. Legal teams face growing demand with fewer people and tighter deadlines. AI hasn’t eased the pressure. It has raised the expectation. If the model returns a clean output, someone assumes that’s good enough.

This shift doesn’t always happen formally. It creeps in through tools that look finished. I’ve written before about how outputs can quietly become decisions, especially when review feels harder than acceptance.

3. Confidence Isn’t Competence

Munch flagged contracts it barely read. It hallucinated values and misunderstood the services being described. None of that showed in the output. Everything came back polished, structured, and sure of itself.

Language models speak with the same tone whether they’re right or wrong. They mirror form without necessarily grasping function. I saw this often when testing clause extraction during due diligence. Models would replicate the shape of an indemnity or warranty clause without picking up the actual substance.

That’s the problem. It’s not just that the answer is wrong. It’s that it looks completely fine until someone relies on it.

4. Governance Isn’t a Add-On

You can’t build trust by layering on audit logs at the end. If the tool is involved in decision-making, governance has to be part of the foundation.

Track prompts.
Log what the model saw.
Capture what changed, who approved it, and when.

If your tool can’t answer those questions, it doesn’t belong anywhere near legal or operational workflows.

This ties back to a thread I’ve returned to across multiple projects. If a system can’t show its working, it isn’t automating decisions. It’s just hiding where they came from.

Governance isn’t just boring admin, it’s how you stop mistakes from becoming process.

5. Sometimes "I Don’t Know" Is the Right Answer

Munch never said "not sure." It returned something every time, even when it was clearly guessing. That’s a flaw in the system, not just the model.

In legal work, "not enough information" is a perfectly valid output. That needs to be part of the design. Forcing a model to pick something guarantees you’ll get made-up answers with no signal that they’re unreliable.

This has come up repeatedly in my own evaluations. If the model isn’t allowed to express doubt, it will keep talking anyway.

You don’t need a model that answers everything. You need one that knows when to stop.

6. No Escalation Path Means Quiet Failures

When a contract was wrongly flagged, the response process left no room to challenge it properly. No clear escalation route. No second layer of review. No way to push back except in a tiny comment box, on a tight deadline.

This builds on earlier work I’ve done around legal workflows. What I call intentional handoff. Escalation has to be visible, deliberate, and easy to trigger. Otherwise, judgment calls vanish into automation.

Good tools create space for disagreement. Great ones track how those disagreements change outcomes.

This Isn’t a One-Off

Munch made headlines because of what it impacted. Veterans, healthcare, federal budgets. The design failure wasn’t rare. You can already see the same shape across legal tech.

Tools that suggest deletions, flag clauses, or score risk are often built on vague prompts, tight context windows, and outputs that sound more certain than they should. Evaluation is an afterthought. The result looks sharp, the process moves quickly, and people start trusting it too soon.

This isn’t hypothetical. I’ve seen it again and again when building and testing prompt-led review systems. If you don’t define the task properly, set the right boundaries, or plan for when the model doesn’t know, you’re not building a legal tool, it's more building risk, dressed up as efficiency.

It’s easy to create something that sounds confident. Much harder to build something that holds up when the input’s messy or the stakes are high, which is where the real work is.