Harness Engineering and Codex in Production

The core idea#

This archive card captures an engineering-practice article I think is genuinely important:

OpenAI’s official article: https://openai.com/zh-Hans-CN/index/harness-engineering/ ↗
Gemini-assisted close-reading summary: https://gemini.google.com/share/e447cc0560fa ↗

The most valuable part isn’t the fact that “Codex can write code.” It’s this:

When a team decides not to hand-write code and instead lets an agent generate nearly all of the implementation, how does the focus of software engineering shift?

The article’s answer is crystal clear:

Humans are no longer primarily responsible for writing code
Humans are responsible for designing the environment, the boundaries, the feedback loops, and the knowledge systems
Agents are responsible for execution
The discipline of software engineering moves from “how to write code” to “how to design systems that let agents work reliably”

That’s the one point I think is most worth committing to memory.

Key takeaways#

1. The slogan isn’t AI replacing humans — it’s “Humans steer, agents execute”#

The single most important line in this article can be boiled down to:

Humans steer, agents execute.

What the OpenAI team ran was an extreme experiment:

Start from an empty repository
Have nearly all the code generated by Codex
Including:
- Application logic
- Tests
- CI configuration
- Documentation
- Observability definitions
- Internal tooling
Humans don’t write code directly; they drive output through prompts, review, design constraints, and system building

The article mentions they started with three engineers, later scaled to seven, and over roughly five months shipped a real product approaching 1 million lines of code, opening and merging around 1,500 PRs.

So this article is no longer about “can AI write a demo?” It’s about:

A production-grade project
Continuous deployment
Incidents and fixes
Real users
An agent-first way of organizing software engineering

2. The real bottleneck is no longer writing code — it’s environment design#

One of the article’s strongest observations:

Early progress was slow, not because Codex couldn’t write code, but because:

The environment wasn’t well-defined enough
The tooling wasn’t complete enough
The structure wasn’t clear enough
The feedback loops weren’t direct enough

In other words, the problem wasn’t that the model couldn’t do it — it was that the system hadn’t been built to let the model reliably do it.

So the engineer’s new responsibilities become:

Make goals decomposable
Make context readable
Make rules enforceable
Make results verifiable

Once a project gets stuck, the way to think about it is no longer “let me manually fill in this feature,” but rather:

What capability is the agent actually missing right now? And can that capability be added to the system explicitly?

3. The repo has to become a knowledge base the agent can actually read#

This is the part of the article I agree with most.

For an agent, knowledge that lives outside the repo essentially doesn’t exist:

Slack discussions
Tacit experience inside someone’s head
Verbal consensus buried in Google Docs

If it hasn’t made it into the repo, hasn’t been versioned, and hasn’t been organized in a structured way, it won’t reliably enter the agent’s operating context.

So in the end, their approach wasn’t to maintain one giant AGENTS.md, but rather:

Use a short AGENTS.md as a map
Put the real knowledge in a structured docs/
Make documentation the record system
Continuously clean and update docs via lint / CI / a doc-gardening agent

The philosophy behind this is powerful:

Give the agent a map, not a 1,000-page manual.

I think this is an especially important lesson for any agent-first repo.

4. Design the system for “agent readability”#

Traditional software engineering talks about human readability. The article emphasizes another layer:

agent readability

That is:

Can the repo let the agent quickly locate knowledge?
Can the UI be read and understood by the agent?
Can logs / metrics / traces feed directly into the reasoning loop?
Can the agent reproduce a bug, verify the fix, and collect feedback on its own?

The OpenAI team did a few crucial things:

a. Let Codex read the UI directly#

Hook into the Chrome DevTools Protocol
See DOM snapshots
See screenshots
Navigate and verify the UI

b. Let Codex read observability directly#

Logs are queryable
Metrics are queryable
Traces are queryable
Each worktree maps to a temporary, isolated observability environment

c. Let Codex self-serve debugging inside a worktree#

Every change can spin up an isolated instance
Reproduce the bug
Verify the fix
Record before/after comparison videos

I think this point matters a lot, because it shows:

An agent’s true capability ceiling is often determined by the boundaries of what the system makes observable, verifiable, and controllable.

5. Architectural boundaries must be mechanized, not remembered by humans#

The article stresses this repeatedly:

You can’t rely on “team conventions” alone
You have to write the boundaries into linters, structural tests, and CI rules

They adopted a very strict layered architecture:

Types -> Config -> Repo -> Service -> Runtime -> UI

Plus a unified Providers entry point to handle cross-cutting concerns, such as:

auth
telemetry
connectors
feature flags

The key point:

Let the agent be free in local implementation details
But have zero tolerance on boundaries and dependency directions

This is a lot like:

Be extremely strict about system structure, and reasonably permissive about the specific way things are expressed.

I think this is exactly the right engineering discipline for the agent-first era.

6. Lint / rules / “golden principles” are worth even more in the agent era#

When humans drive development, a lot of conventions can feel a bit tedious. But when agents drive development, those same conventions become multipliers instead.

Because:

Once a rule is explicit
The agent can follow it continuously and at scale
And enforce it consistently across all PRs

The article mentions they mechanize a lot of “taste judgments,” such as:

Naming conventions
File-size limits
Structured-logging rules
Boundary-validation principles
Avoiding YOLO-style data probing

They sum these up as a more subjective but still enforceable class of “golden principles.”

I strongly agree with this, because what it really amounts to is:

Distilling human taste into system constraints that a machine can enforce continuously.

7. In a high-throughput agent environment, “waiting” is more expensive than “making mistakes”#

The article also has a counterintuitive but very real point:

When agent throughput is extremely high:

The cost of fixing a mistake after the fact goes down
The cost of waiting on a human blocker goes up

So they removed a lot of the traditional human-imposed blocking merge gates. Short-lived PRs, fast patches, and continuous refactoring may be more sensible than obsessing over zero defects in a single merge.

This isn’t to say quality doesn’t matter — it’s to say:

In an agent-heavy workflow, quality control should rely more on automated feedback loops, rather than human gatekeeping that throttles throughput.

8. “AI sludge” is a real problem and demands continuous garbage collection#

I find this framing especially vivid:

Letting agents write all the code constantly produces pattern drift and “AI sludge”
If you don’t deal with it, the codebase slowly accumulates entropy

At first they spent 20% of every Friday cleaning up by hand, which obviously doesn’t scale. So they systematized this too:

Run background Codex jobs on a schedule
Scan for deviations
Update quality grades
Automatically open small refactoring PRs

It runs continuously, like garbage collection.

This line of thinking is worth remembering:

Technical debt isn’t something you pay off all at once at the end — you pay it down continuously in small amounts, like servicing a high-interest loan.

Current understanding / conclusions#

My core read on this article is:

What it’s really about is:

What an agent-first repo should look like
Where the future engineer’s leverage lies
Where the discipline of software engineering should be placed

The most important shift: from code craftsmanship to systems craftsmanship#

The old emphasis was:

How to hand-write better code

What matters more now is:

How to design a better environment
How to make context readable to the agent
How to automate the feedback loop
How to turn boundaries and taste into an enforceable system

The three keywords that resonated most with me#

Map, not manual
- Give the agent a map, not a verbose manual
Agent readability
- Everything must account for whether the agent can reason about and verify it directly
Harness engineering
- The engineering focus moves from writing the implementation to building the harness that lets agents work efficiently

Implications for my actual work#

If I translate this article into more concrete advice for my own work, the most useful points are:

1. The repo should become a real knowledge system#

Key consensus must go into the repo
Docs must be structured
AGENTS.md should be short, stable, and map-like

2. Let the agent see the UI and observability directly#

Screenshots
DOM
Logs
Metrics
Traces

These aren’t “nice-to-have extras” — they’re core infrastructure for an agent workflow.

3. Write boundaries and taste as code#

lint
CI
structural tests
golden rules

Whatever can be automated shouldn’t rely on human memory.

4. Keep doing doc gardening and AI-sludge cleanup#

Docs rot
Patterns drift
You have to clean continuously

5. The scarcest human resource becomes attention#

There’s an underlying logic in the article that rings very true:

The scarcest thing in the future isn’t code output — it’s human time, attention, and judgment.

To be added#

Directions worth fleshing out later:

How to port this workflow to smaller teams / personal projects
How exec plans should actually be written in an agent workflow
Automation design for doc gardening / repo linting / quality-drift management
How to redesign the quality system once PR lifecycles get shorter
What this pattern means, respectively, for frontend, full-stack, and infrastructure projects

OpenAI’s official article: https://openai.com/zh-Hans-CN/index/harness-engineering/ ↗
Gemini close-reading summary: https://gemini.google.com/share/e447cc0560fa ↗