OpenAI: Harness Engineering — Leveraging Codex in an Agent-First World

Original: openai.com/index/harness-engineering · OpenAI · March 2026 Category: Official / Foundational

Overview

This is the blog post that put "Harness Engineering" into mainstream discourse. OpenAI describes how a team of three engineers built and shipped a software product with zero lines of manually-written code — over 1 million lines generated by Codex across ~1,500 merged PRs in five months.

The core thesis: humans steer, agents execute. When an engineer's job is no longer to write code, but to design environments, specify intent, and build feedback loops for agents, the discipline is called Harness Engineering.

Key Takeaways

1. Humans Design Environments, Not Code

The primary job of the engineering team became enabling agents to do useful work. When something failed, the fix was never "try harder" — it was "what capability is missing, and how do we make it legible and enforceable for the agent?"

2. Give Agents a Map, Not a Manual

The team tried the "one big AGENTS.md" approach. It failed:

A giant instruction file crowds out the actual task and code
When everything is "important," nothing is — agents pattern-match locally instead of navigating intentionally
Monolithic manuals rot instantly and are impossible to verify

Solution: Treat AGENTS.md as a table of contents (~100 lines), with pointers to deeper sources of truth in a structured docs/ directory.

3. Constraints Beat Instructions

OpenAI uses custom linters to enforce architectural rules. Lint error messages serve double duty as repair instructions for the agent. Key insight:

"Constraints are executable and deterministic. Instructions are interpretable and ambiguous. In the agent workflow, this distinction matters more than in human teams."

4. Perfectionism Kills Throughput

The team adopted minimal-blocking merges — waiting is more expensive than fixing. Agent-to-agent review loops handle quality iteration:

Engineer writes task prompt
  → Codex executes (often 6+ hours, while humans sleep)
    → Codex self-reviews locally
      → Requests additional agent reviews
        → Iterates in loop until all reviewers satisfied
          → PR opened (human review optional)

Average throughput: 3.5 PRs per engineer per day, increasing as the team grew from 3 to 7 engineers.

5. Context Management Is the Core Challenge

Knowledge must be:

Version-controlled — in the repo, not in Slack or Google Docs
Discoverable — agent can find it when needed
Structured — design docs, architecture docs, verification status all catalogued

"What Codex can't see doesn't exist."

6. Make Everything Legible to Agents

The team wired Chrome DevTools Protocol into the agent runtime, created skills for DOM snapshots, screenshots, and navigation. They exposed logs (LogQL) and metrics (PromQL) to Codex. This enabled prompts like:

"Ensure service startup completes in under 800ms" "No span in these four critical user journeys exceeds two seconds"

Why This Matters for Harness Engineering

This article established the Interaction Scalability dimension: how do you let humans steer large numbers of agents with minimal intervention? OpenAI's answer evolved from "write prompts, trigger Codex" into Symphony — a persistent daemon that turns Linear tickets into automated agent runs with Proof of Work.

The broader implication: harness engineering is a meta-discipline. Improvements to the harness (better docs, better tests, better constraints) compound across all future agent runs. Writing code doesn't compound. Improving the environment does.