Case Study — AI Workflow Architecture

What Breaks When You Give Two AI Agents a Shared Notepad

How I built (and rebuilt) the protocol layer between a planning agent and an execution agent — and why the handoff architecture is the architecture.

I run two AI agents as a solo developer. One plans. One executes. I sit in between and approve things before they touch my codebase. It sounds clean on paper. It was clean — for about three sessions. Then everything started rotting.

Not the code. The code was fine. What rotted was the space between the agents: the handoff files, the state tracking, the shared context that was supposed to carry meaning from one session to the next. Within two weeks, my carefully structured workflow had degraded into something I started calling "the junk drawer" — a single Markdown file stuffed with completed tasks that nobody cleared, decisions that contradicted each other, and status updates from three sessions ago that my executor agent was confidently acting on as if they were current.

This is the story of how I diagnosed those failures, researched the landscape of multi-agent state management, and arrived at a handoff architecture that actually survives contact with reality. If you're building multi-agent systems — whether for development workflows, business automation, or anything involving persistent state across sessions — the failure modes I hit are coming for you too.

What Broke, and Why

Three failure modes surfaced within the first dozen sessions. Each one looked like a different problem. They weren't.

Context rot is the most fundamental. It's a property of how transformer attention works: as the context window fills with conversation history, the model's ability to recall constraints from earlier in the session degrades. Research from Stanford demonstrated that accuracy can drop from 75% to 55% when critical information is buried in the middle of a moderate context window. In practice, this meant my planning agent would forget architectural decisions it had made two sessions earlier and propose approaches that had already been tried and failed. The executor would re-investigate solved problems because the resolution was buried under pages of newer output.

Junk drawer handoffs are what happens when you use a single file for everything. My initial handoff file tried to be a task queue, a decision log, an execution trace, and a session summary simultaneously. Every session appended to it. The instructions explicitly said to remove completed items — but once the file crossed a critical mass, the agents stopped following those instructions. Context rot ate its own mitigation: the rule to keep the file clean was buried in a file that had gotten too dirty to parse reliably. The executor agent couldn't reliably distinguish between a completed task and an active one because the formatting had drifted — the planner might use ### Pending Tasks one session and ## To-Do List the next, and the executor's pattern matching would silently miss the renamed section. Instead of crashing, the agent would hallucinate a response. The file grew until it consumed a meaningful portion of the context window just being read, crowding out space for actual reasoning.

Cross-project contamination was the subtlest failure and the most damaging. I operate multiple projects out of a single knowledge base. When the executor searched for context on "state management," it pulled results from every project — business planning notes, product architecture docs, personal process logs — because nearly every project file mentions configuration or state in some form. The agent's context window would flood with irrelevant architectural decisions from unrelated initiatives when it only needed the routing logic for the specific codebase it was modifying.

All three failures share a root cause: I was asking my knowledge base to do two jobs it couldn't do simultaneously. A knowledge base is designed for human cognitive mapping — semantic search, dense linking, interconnected ideas. An agent execution layer needs the opposite: strict boundaries, deterministic state, predictable schemas.

The Research Detour

Before rebuilding, I commissioned a deep research pass across the multi-agent ecosystem. I wanted to know what the production-grade systems were doing, and whether I was reinventing something that already existed.

The landscape sorts into a spectrum. On one end, you have graph-based orchestration frameworks like LangGraph — full state machines with checkpoint persistence, typed reducers, and Redis-backed memory. They solve every problem I was hitting. They're also wildly overengineered for a two-agent setup run by one person. Operating LangGraph means deploying and maintaining a separate Python application just to manage the handoffs.

On the other end, you have the AGENTS.md convention — a standardized instruction file that lives in your repository root and gets loaded at the start of every agent session. It's universally adopted and effective for static context: coding style, test commands, environment constraints. But it has zero capacity for dynamic execution state. It's a README, not a database.

In the middle, I found the patterns that actually mattered for my scale:

The Session-End Spec Update pattern, documented by enterprise AI practitioners, treats session boundaries as a feature rather than a bug. Instead of trying to maintain continuity through one massive conversation, you bound each session and extract exactly four things at the end: what was completed, what decisions were made (with immutable IDs), what constraints were discovered (so the next session doesn't retry failed approaches), and what changed in the dependency landscape. This compact summary gives the next session a clean start with full institutional memory.

The hybrid handoff format — Markdown for the human approval layer, structured data for the machine execution layer — resolves the tension between human readability and agent reliability. Pure JSON is hostile to human review. Pure Markdown is unparseable for agents at scale. The emerging pattern embeds both in a single artifact.

Project isolation via repository-scoped boundaries was the clearest consensus finding. A single global handoff file is an anti-pattern. Every project needs its own isolated handoff boundary, stored directly within its repository structure, so the execution context travels with the code.

The Architecture That Emerged

What I built isn't a framework. It's a protocol — a set of conventions enforced through file structure rather than code. Three files, strict ownership rules, and a clear lifecycle.

The state transfer file carries exactly what the next session needs to resume: the current focus, active threads with status indicators, a decisions table with immutable IDs and rationale, and a constraints table documenting what's been tried and failed. It's bounded — completed items are removed, not checked off. Archived items go to a separate historical record. The file stays lean by design, never exceeding what fits comfortably in an agent's context window alongside actual work.

The work package file is a bidirectional mailbox. The planner writes execution instructions into one section. The executor reads them, does the work, and clears them. If the executor hits a blocker or has a question, it writes into a separate section going the other direction. Each section is cleared after the receiving agent processes it. The file resets to a sentinel state between uses — it never accumulates. This was the single biggest improvement. By separating "what to do next" from "what's the overall state of the project," both files stay parseable and neither becomes a junk drawer.

The decisions and constraints tables deserve their own mention because they're the mechanism that breaks the context rot cycle. When a session discovers that a specific approach fails — a library incompatibility, a platform limitation, a caching behavior that invalidates assumptions — it gets logged as a numbered constraint. The next planner session reads the constraints table before generating a new plan. This prevents the most expensive failure mode in multi-agent systems: the agent confidently proposing an approach that was already tried and failed, burning an entire session rediscovering the same dead end.

Project isolation is enforced structurally. Each project has its own state files in its own directory. The planner operates at the workspace level and can see across projects. The executor is spawned inside a specific project directory and inherits only that project's context. It's structurally blind to everything else. This completely eliminates cross-project contamination.

What Still Broke

I'd be lying if I said the architecture solved everything cleanly. Three friction points persist, and they're instructive because they're not protocol failures — they're environmental realities that no file convention can fix.

Prose drift is ongoing. Agents don't respect Markdown structure reliably without enforcement. The planner occasionally renames table headers or changes the format of status indicators between sessions. Without programmatic schema validation, this requires human vigilance during review. It's manageable at my scale but would be a showstopper for a fully automated pipeline.

Stale reads turned out to be an infrastructure problem masquerading as a protocol problem. My planning agent runs in a cloud-hosted sandbox that mounts my local filesystem through a virtualized bridge. The bridge caches aggressively and doesn't invalidate when the local filesystem changes. The result: the planner confidently reads a handoff file that the executor updated twenty minutes ago, sees outdated state, and generates a plan based on information that's already wrong. The mitigation is procedural — start a fresh session after every executor run to get a clean mount — but it took five debugging sessions to identify the root cause because the agent never signals that it's reading stale data. It just proceeds with full confidence.

Write truncation was the scariest failure. On three separate occasions, the executor's session ended with the handoff file truncated mid-sentence — the write process failed partway through, leaving corrupt state. Because the file looked mostly correct on casual inspection, the planner would load it, parse the truncated section as intentional, and propagate the corruption forward. The fix was a post-write verification step — checksum the file, verify line count, warn if the file shrank — but the root cause remains unresolved.

The Takeaway

If you're building multi-agent systems, the instinct is to focus on the agents: which model, what temperature, how many tools, what system prompt. That's the wrong place to look. The agents are the easy part. They're stateless by design — every session starts from zero.

The hard engineering is in the protocol layer between them. How state transfers across session boundaries. How you prevent completed work from polluting active context. How you stop one project's decisions from contaminating another. How you detect when an agent is operating on stale assumptions with full confidence.

Your handoff architecture isn't a detail of your multi-agent system. It is your multi-agent system. Everything else is an API call.

Tim Downs Mullen is a systems engineering leader with 25 years in aerospace, defense, and healthcare technology. He builds AI-augmented development workflows for regulated environments where "move fast and break things" isn't an option.