Anthropic: Multi-Agent Harness Design for Long-Running Applications
Original: anthropic.com/engineering/harness-design-long-running-apps · Prithvi Rajasekaran, Anthropic Labs · March 2026 Category: Official / Foundational
Overview
Anthropic's engineering team tackles a specific problem: how do you keep an AI agent reliable and on-track during multi-hour autonomous coding sessions? Their answer is a multi-agent architecture inspired by GANs (Generative Adversarial Networks) — split the agent into a Generator that builds and an Evaluator that judges.
This article is the foundational text for Time Scalability in harness engineering: making a single agent work reliably over long periods.
The Two Failure Modes
Anthropic identifies two problems that environment design alone cannot prevent:
1. Direction Drift (Context Anxiety)
As the context window fills up, the model loses coherence:
- Forgets early constraints
- Drifts from the original direction
- Dives deeper into irrelevant details
- Context anxiety: starts wrapping up work prematurely when approaching what it believes is the context limit
Solution: Context Reset, not Compaction
Compaction (summarizing old conversation) preserves continuity but doesn't give the agent a clean slate. Context anxiety persists.
Context reset clears the entire context window and starts a fresh agent with a structured handoff artifact — enough state for the new agent to continue cleanly. More expensive in orchestration, but fundamentally more reliable.
2. Self-Evaluation Blindness
When asked to evaluate their own work, agents consistently praise it — even when the quality is mediocre. This is especially severe for subjective tasks (design, UX), but even occurs in tasks with verifiable outcomes.
Solution: Separate Generator and Evaluator
Splitting "doing" and "judging" into two independent agents is the key lever. The evaluator doesn't share internal state with the generator, which is what makes honest judgment possible. Then tuning the evaluator to be skeptical by default becomes tractable.
The Three-Agent Architecture
User Requirement (one sentence)
│
▼
┌─────────┐
│ Planner │ Expands requirement → full product spec
│ │ Product-level design only, no implementation details
└────┬────┘
│ spec
▼
┌─────────┐
│Generator │ Implements features sprint-by-sprint
│ │ Context reset between sprints
└────┬────┘
│ output
▼
┌─────────┐
│Evaluator │ Uses Playwright to interact with the running app
│ │ Grades against sprint contract
│ │ Strict by default
└────┬────┘
│ pass/fail
▼
Pass → Next sprint
Fail → Generator iterates
Key Design Choices
- Planner deliberately avoids implementation details — specifying technical details too early causes error cascading
- Evaluator uses Playwright to test the real running app — not just reading code, but clicking buttons and checking UI
- 5–15 iteration rounds per evaluator cycle
- Sprint contract defines clear acceptance criteria per cycle
The Experiment: Digital Audio Workstation
The architecture produced a complete digital audio workstation:
- Runtime: ~4 hours
- Cost: $124
- Generator first run: 2 hours 7 minutes continuous
- Baseline comparison: Same prompt, single agent, 20 minutes, $9 — core features non-functional
The 13x cost increase bought a qualitative leap from "doesn't work" to "complete and functional."
The Most Important Insight: Harness Components Have Expiration Dates
"Every harness component encodes a hypothesis about what the current model can't do."
Each component has a different expiration speed:
| Component | Hypothesis | Status (Opus 4.6) |
|---|---|---|
| Context reset | Model can't maintain coherence in long context | ❌ Retired |
| Sprint decomposition | Model can't maintain direction in long sessions | ❌ Retired |
| Evaluator | Model overrates its own work | ✅ Still valuable |
When a new model releases, the correct action is to remove old components and test whether quality actually drops — not to keep stacking new components on top.
Why This Matters
This article introduces three ideas that are becoming foundational:
- Time Scalability as a distinct dimension of harness engineering
- Adversarial evaluation as a general-purpose quality mechanism
- Harness component lifecycle management — the discipline of removing components when models outgrow them
The GAN-inspired approach (generator + evaluator) has since been adopted by multiple open-source harness implementations.
See also: OpenAI: Harness Engineering · Wayne Zhang: Three Scaling Dimensions