Harness Engineering and Codex in Production
A practical write-up on how the OpenAI team built a production-grade project with code written almost 100% by Codex.
The core idea#
This archive card captures an engineering-practice article I think is genuinely important:
- OpenAI’s official article: https://openai.com/zh-Hans-CN/index/harness-engineering/ ↗
- Gemini-assisted close-reading summary: https://gemini.google.com/share/e447cc0560fa ↗
The most valuable part isn’t the fact that “Codex can write code.” It’s this:
When a team decides not to hand-write code and instead lets an agent generate nearly all of the implementation, how does the focus of software engineering shift?
The article’s answer is crystal clear:
- Humans are no longer primarily responsible for writing code
- Humans are responsible for designing the environment, the boundaries, the feedback loops, and the knowledge systems
- Agents are responsible for execution
- The discipline of software engineering moves from “how to write code” to “how to design systems that let agents work reliably”
That’s the one point I think is most worth committing to memory.
Key takeaways#
1. The slogan isn’t AI replacing humans — it’s “Humans steer, agents execute”#
The single most important line in this article can be boiled down to:
Humans steer, agents execute.
What the OpenAI team ran was an extreme experiment:
- Start from an empty repository
- Have nearly all the code generated by Codex
- Including:
- Application logic
- Tests
- CI configuration
- Documentation
- Observability definitions
- Internal tooling
- Humans don’t write code directly; they drive output through prompts, review, design constraints, and system building
The article mentions they started with three engineers, later scaled to seven, and over roughly five months shipped a real product approaching 1 million lines of code, opening and merging around 1,500 PRs.
So this article is no longer about “can AI write a demo?” It’s about:
- A production-grade project
- Continuous deployment
- Incidents and fixes
- Real users
- An agent-first way of organizing software engineering
2. The real bottleneck is no longer writing code — it’s environment design#
One of the article’s strongest observations:
Early progress was slow, not because Codex couldn’t write code, but because:
- The environment wasn’t well-defined enough
- The tooling wasn’t complete enough
- The structure wasn’t clear enough
- The feedback loops weren’t direct enough
In other words, the problem wasn’t that the model couldn’t do it — it was that the system hadn’t been built to let the model reliably do it.
So the engineer’s new responsibilities become:
- Make goals decomposable
- Make context readable
- Make rules enforceable
- Make results verifiable
Once a project gets stuck, the way to think about it is no longer “let me manually fill in this feature,” but rather:
What capability is the agent actually missing right now? And can that capability be added to the system explicitly?
3. The repo has to become a knowledge base the agent can actually read#
This is the part of the article I agree with most.
For an agent, knowledge that lives outside the repo essentially doesn’t exist:
- Slack discussions
- Tacit experience inside someone’s head
- Verbal consensus buried in Google Docs
If it hasn’t made it into the repo, hasn’t been versioned, and hasn’t been organized in a structured way, it won’t reliably enter the agent’s operating context.
So in the end, their approach wasn’t to maintain one giant AGENTS.md, but rather:
- Use a short
AGENTS.mdas a map - Put the real knowledge in a structured
docs/ - Make documentation the record system
- Continuously clean and update docs via lint / CI / a doc-gardening agent
The philosophy behind this is powerful:
Give the agent a map, not a 1,000-page manual.
I think this is an especially important lesson for any agent-first repo.
4. Design the system for “agent readability”#
Traditional software engineering talks about human readability. The article emphasizes another layer:
agent readability
That is:
- Can the repo let the agent quickly locate knowledge?
- Can the UI be read and understood by the agent?
- Can logs / metrics / traces feed directly into the reasoning loop?
- Can the agent reproduce a bug, verify the fix, and collect feedback on its own?
The OpenAI team did a few crucial things:
a. Let Codex read the UI directly#
- Hook into the Chrome DevTools Protocol
- See DOM snapshots
- See screenshots
- Navigate and verify the UI
b. Let Codex read observability directly#
- Logs are queryable
- Metrics are queryable
- Traces are queryable
- Each worktree maps to a temporary, isolated observability environment
c. Let Codex self-serve debugging inside a worktree#
- Every change can spin up an isolated instance
- Reproduce the bug
- Verify the fix
- Record before/after comparison videos
I think this point matters a lot, because it shows:
An agent’s true capability ceiling is often determined by the boundaries of what the system makes observable, verifiable, and controllable.
5. Architectural boundaries must be mechanized, not remembered by humans#
The article stresses this repeatedly:
- You can’t rely on “team conventions” alone
- You have to write the boundaries into linters, structural tests, and CI rules
They adopted a very strict layered architecture:
Types -> Config -> Repo -> Service -> Runtime -> UI
Plus a unified Providers entry point to handle cross-cutting concerns, such as:
- auth
- telemetry
- connectors
- feature flags
The key point:
- Let the agent be free in local implementation details
- But have zero tolerance on boundaries and dependency directions
This is a lot like:
Be extremely strict about system structure, and reasonably permissive about the specific way things are expressed.
I think this is exactly the right engineering discipline for the agent-first era.
6. Lint / rules / “golden principles” are worth even more in the agent era#
When humans drive development, a lot of conventions can feel a bit tedious. But when agents drive development, those same conventions become multipliers instead.
Because:
- Once a rule is explicit
- The agent can follow it continuously and at scale
- And enforce it consistently across all PRs
The article mentions they mechanize a lot of “taste judgments,” such as:
- Naming conventions
- File-size limits
- Structured-logging rules
- Boundary-validation principles
- Avoiding YOLO-style data probing
They sum these up as a more subjective but still enforceable class of “golden principles.”
I strongly agree with this, because what it really amounts to is:
Distilling human taste into system constraints that a machine can enforce continuously.
7. In a high-throughput agent environment, “waiting” is more expensive than “making mistakes”#
The article also has a counterintuitive but very real point:
When agent throughput is extremely high:
- The cost of fixing a mistake after the fact goes down
- The cost of waiting on a human blocker goes up
So they removed a lot of the traditional human-imposed blocking merge gates. Short-lived PRs, fast patches, and continuous refactoring may be more sensible than obsessing over zero defects in a single merge.
This isn’t to say quality doesn’t matter — it’s to say:
In an agent-heavy workflow, quality control should rely more on automated feedback loops, rather than human gatekeeping that throttles throughput.
8. “AI sludge” is a real problem and demands continuous garbage collection#
I find this framing especially vivid:
- Letting agents write all the code constantly produces pattern drift and “AI sludge”
- If you don’t deal with it, the codebase slowly accumulates entropy
At first they spent 20% of every Friday cleaning up by hand, which obviously doesn’t scale. So they systematized this too:
- Run background Codex jobs on a schedule
- Scan for deviations
- Update quality grades
- Automatically open small refactoring PRs
It runs continuously, like garbage collection.
This line of thinking is worth remembering:
Technical debt isn’t something you pay off all at once at the end — you pay it down continuously in small amounts, like servicing a high-interest loan.
Current understanding / conclusions#
My core read on this article is:
It’s not a “AI makes coding faster” promo piece#
What it’s really about is:
- What an agent-first repo should look like
- Where the future engineer’s leverage lies
- Where the discipline of software engineering should be placed
The most important shift: from code craftsmanship to systems craftsmanship#
The old emphasis was:
- How to hand-write better code
What matters more now is:
- How to design a better environment
- How to make context readable to the agent
- How to automate the feedback loop
- How to turn boundaries and taste into an enforceable system
The three keywords that resonated most with me#
- Map, not manual
- Give the agent a map, not a verbose manual
- Agent readability
- Everything must account for whether the agent can reason about and verify it directly
- Harness engineering
- The engineering focus moves from writing the implementation to building the harness that lets agents work efficiently
Implications for my actual work#
If I translate this article into more concrete advice for my own work, the most useful points are:
1. The repo should become a real knowledge system#
- Key consensus must go into the repo
- Docs must be structured
- AGENTS.md should be short, stable, and map-like
2. Let the agent see the UI and observability directly#
- Screenshots
- DOM
- Logs
- Metrics
- Traces
These aren’t “nice-to-have extras” — they’re core infrastructure for an agent workflow.
3. Write boundaries and taste as code#
- lint
- CI
- structural tests
- golden rules
Whatever can be automated shouldn’t rely on human memory.
4. Keep doing doc gardening and AI-sludge cleanup#
- Docs rot
- Patterns drift
- You have to clean continuously
5. The scarcest human resource becomes attention#
There’s an underlying logic in the article that rings very true:
The scarcest thing in the future isn’t code output — it’s human time, attention, and judgment.
To be added#
Directions worth fleshing out later:
- How to port this workflow to smaller teams / personal projects
- How exec plans should actually be written in an agent workflow
- Automation design for doc gardening / repo linting / quality-drift management
- How to redesign the quality system once PR lifecycles get shorter
- What this pattern means, respectively, for frontend, full-stack, and infrastructure projects
Related links / sources#
- OpenAI’s official article: https://openai.com/zh-Hans-CN/index/harness-engineering/ ↗
- Gemini close-reading summary: https://gemini.google.com/share/e447cc0560fa ↗