Fail-Closed AI Agents | Tim Downs Mullen

I let AI agents take real actions on my systems. Not just produce text — edit files, run shell commands, call APIs, write to systems of record. The moment you do that in a regulated domain, the math of a mistake changes. A bad paragraph, you regenerate. An action the operator never approved — a disclosed record, an unauthorized change to a system of record, a file sent to the wrong party — you report. The cost stops being "try again" and becomes a reportable event.

So how does almost everyone govern these agents today? With prose. The system prompt says "always ask before deleting anything." The policy doc says "escalate to a human for anything sensitive." These are real attempts at control, and they're better than nothing. They're also the weakest kind, because they share one fatal property: they fail open.

A control fails open when, the moment it stops working, the dangerous thing happens anyway. Prose governance is exactly this. The instruction is advisory — the model has to choose to honor it on every single turn. Under context pressure, under momentum toward the goal it was pointed at, under an ambiguous request, that instruction is the first thing to slip. And when it slips, nothing catches the action. The agent just does the thing. The guardrail was a suggestion, and suggestions aren't load-bearing.

Engineers have a name for the opposite posture. A fail-closed system defaults to the safe state when something goes wrong: the brake that engages when air pressure is lost, the door that locks when the power dies, the valve that shuts when the signal drops. You don't trust the operator to remember to pull the brake. You build a brake that pulls itself.

Agentic governance has to move the same way: out of the prose and into a structural gate. Stop instructing the agent to ask permission — put a mechanism between its intent and its effect that physically blocks any action outside an approved envelope, before it runs.

The Gate

What I built sits at the tool boundary. Every consequential tool call — every file write, every command — gets intercepted before it executes and checked against an approved plan. The protocol is boring on paper: PLAN, then CONFIRM, then EXECUTE. The agent states, in advance, exactly what it intends to do — which files, which commands. I read it and say yes, no, or modify. The agent may then act on the confirmed items, and only those. An action that isn't in the approved plan does not run.

The difference from a careful system prompt — the whole difference — is where the rule lives. In the prose approach, the agent is simultaneously the actor and the thing deciding whether to honor the rule. In the structural approach, the check runs in ordinary code the agent doesn't control. It can't reason its way past it, can't decide the rule doesn't apply this time, can't forget it under momentum. The rule isn't something the agent is asked to follow. It's a condition of the action happening at all. This is the difference between asking a driver to brake and installing a brake that engages itself.

The governing bias is consent over completion. An autonomous system has a natural pull toward finishing the task — that's what it was pointed at, and that pull is exactly the momentum that runs over a guardrail. So the default is inverted: when an action is unapproved, ambiguous, or surprising, the gate denies and stops for a human. An agent that can't get approval does nothing. Doing nothing is the safe state, and the system defaults to it.

How It Got Strict

I didn't design the strict version up front. Every tightening was scar tissue from a specific failure that slipped through the looser version. Two are worth telling, because they're the ones that convinced me the enforcement was real and not theater.

The first: the gate blocked its own author's agent. Mid-session, my assistant tried to save a routine note to its own memory directory. The gate denied it — no approved plan named that path. Nothing about the write was dangerous; it was the agent's own scratchpad. That is the point. A structural gate doesn't reason about intent or benignity. It blocks anything outside the approved envelope, including my own trusted tooling. Prose governance would have waved a "harmless" write through. The fix was the right one — widen the approved plan to name the path explicitly — not "trust the agent because it meant well."

The second cut the other way. I'd added a tightly-scoped carve-out so the agent could write to one specific memory directory without ceremony. A different memory directory — one character of path apart — wasn't covered, so writes there got denied unexpectedly. The lesson runs both directions: tight scoping keeps the blast radius small, but it produces "why didn't it work here?" surprises. The resolution was to authorize the specific path explicitly — not to broaden the carve-out to "all memory directories." Choosing the narrow, annoying fix over the broad, convenient one is the entire discipline in miniature.

What It Catches, and What It Only Sees

I want to be honest about the boundary, because governance is only credible if it survives contact with how agents actually break. Some failures the gate prevents outright: an action outside the approved plan simply does not run, and momentum that would have shipped an unapproved change under prose governance gets stopped at the gate instead.

Others it only detects, or only mitigates — and pretending otherwise would be the same fail-open dishonesty I'm arguing against. A write silently truncated mid-content; a cached filesystem that serves stale reads so the agent acts on pre-edit state and signals nothing. The gate can warn, but it doesn't prevent those. They live a layer below authorization, in the state-transfer machinery between agent sessions. I wrote about that layer separately — what breaks when you give two AI agents a shared notepad — because those failures have a different shape and a different fix. This piece is about the authorization layer: keeping the agent inside an envelope I approved.

The Compliance Falls Out

Here's the part I didn't expect. I built the gate to keep the agent safe. I did not build an "audit-logging feature." But every gate decision — every allow, every deny — gets recorded with a timestamp, the target, the command, and the plan that governed it. That record is just exhaust. The gate produces it while doing its actual job.

And that exhaust turns out to be most of what a SOC 2 or ISO audit asks you to evidence: who was permitted to touch what, a monitoring trail of policy-violation attempts, a documented approve-act-record cycle for every change, a human as the sole authorizing authority. The same handful of mechanisms answers a security framework, a quality framework, and an AI-management framework at once — because all three are, at bottom, asking the same question: can you show this system only did what it was authorized to do, and prove it after the fact? A fail-closed gate answers yes by construction.

You don't build these controls to pass an audit. You build them to keep the agent safe — and the audit evidence is what they emit while doing it.

What I'd Tell You

If you take one thing from this: don't ask an agent to behave — build a gate it can't work around. Any governance that lives in prose is a request the agent can decline, and under enough pressure it eventually will. The only governance you can rely on runs in code the agent doesn't control and defaults to deny when it's unsure.

Four moves carry almost all the weight:

Put the rule at the tool boundary, not in the prompt. A pre-action hook that checks every consequential call against an approved plan is a small amount of code, and it changes your risk posture immediately. Start there.
Fail closed. When an action is unapproved, ambiguous, or surprising, block it and stop for a human. Bias the whole system toward consent over completion.
Enumerate; never grant broadly. Authorize the exact files and the exact commands, not the directory or the tool. Broad grants are fail-open invitations wearing the costume of convenience.
Let the gate be your audit trail. Don't build a separate compliance feature. Log every gate decision and you'll find you've already produced most of what the auditors want.

None of this needs a big team or a greenfield system. The gate I run was extracted from a working product, and its strictest features are scar tissue from real failures — which means you can adopt it incrementally, and let your own failures tell you where to tighten next. I open-sourced it as plan-gate, the flagship module of a suite I call StrictLock.

Build the gate. Let it hold.

Tim Downs Mullen is a systems engineering leader with 25 years in aerospace, defense, and healthcare technology. He builds AI-augmented development workflows for regulated environments where "move fast and break things" isn't an option.