OpenAI: Harness Engineering — Leveraging Codex in an Agent-First World
Original: openai.com/index/harness-engineering · OpenAI · March 2026 Category: Official / Foundational
Overview
This is the blog post that put "Harness Engineering" into mainstream discourse. OpenAI describes how a team of three engineers built and shipped a software product with zero lines of manually-written code — over 1 million lines generated by Codex across ~1,500 merged PRs in five months.
The core thesis: humans steer, agents execute. When an engineer's job is no longer to write code, but to design environments, specify intent, and build feedback loops for agents, the discipline is called Harness Engineering.
Key Takeaways
1. Humans Design Environments, Not Code
The primary job of the engineering team became enabling agents to do useful work. When something failed, the fix was never "try harder" — it was "what capability is missing, and how do we make it legible and enforceable for the agent?"
2. Give Agents a Map, Not a Manual
The team tried the "one big AGENTS.md" approach. It failed:
- A giant instruction file crowds out the actual task and code
- When everything is "important," nothing is — agents pattern-match locally instead of navigating intentionally
- Monolithic manuals rot instantly and are impossible to verify
Solution: Treat AGENTS.md as a table of contents (~100 lines), with pointers to deeper sources of truth in a structured docs/ directory.
3. Constraints Beat Instructions
OpenAI uses custom linters to enforce architectural rules. Lint error messages serve double duty as repair instructions for the agent. Key insight:
"Constraints are executable and deterministic. Instructions are interpretable and ambiguous. In the agent workflow, this distinction matters more than in human teams."
4. Perfectionism Kills Throughput
The team adopted minimal-blocking merges — waiting is more expensive than fixing. Agent-to-agent review loops handle quality iteration:
Engineer writes task prompt
→ Codex executes (often 6+ hours, while humans sleep)
→ Codex self-reviews locally
→ Requests additional agent reviews
→ Iterates in loop until all reviewers satisfied
→ PR opened (human review optional)
Average throughput: 3.5 PRs per engineer per day, increasing as the team grew from 3 to 7 engineers.
5. Context Management Is the Core Challenge
Knowledge must be:
- Version-controlled — in the repo, not in Slack or Google Docs
- Discoverable — agent can find it when needed
- Structured — design docs, architecture docs, verification status all catalogued
"What Codex can't see doesn't exist."
6. Make Everything Legible to Agents
The team wired Chrome DevTools Protocol into the agent runtime, created skills for DOM snapshots, screenshots, and navigation. They exposed logs (LogQL) and metrics (PromQL) to Codex. This enabled prompts like:
"Ensure service startup completes in under 800ms" "No span in these four critical user journeys exceeds two seconds"
Why This Matters for Harness Engineering
This article established the Interaction Scalability dimension: how do you let humans steer large numbers of agents with minimal intervention? OpenAI's answer evolved from "write prompts, trigger Codex" into Symphony — a persistent daemon that turns Linear tickets into automated agent runs with Proof of Work.
The broader implication: harness engineering is a meta-discipline. Improvements to the harness (better docs, better tests, better constraints) compound across all future agent runs. Writing code doesn't compound. Improving the environment does.
See also: Anthropic: Multi-Agent Harness Design · Wayne Zhang: Three Scaling Dimensions