Joye Personal Blog

Back

🔬 Research ✅ Ready · aiagentcodexsoftware engineeringworkflow

Harness Engineering and Codex in Production

A practical write-up on how the OpenAI team built a production-grade project with code written almost 100% by Codex.

Updated March 14, 2026

The core idea#

This archive card captures an engineering-practice article I think is genuinely important:

The most valuable part isn’t the fact that “Codex can write code.” It’s this:

When a team decides not to hand-write code and instead lets an agent generate nearly all of the implementation, how does the focus of software engineering shift?

The article’s answer is crystal clear:

  • Humans are no longer primarily responsible for writing code
  • Humans are responsible for designing the environment, the boundaries, the feedback loops, and the knowledge systems
  • Agents are responsible for execution
  • The discipline of software engineering moves from “how to write code” to “how to design systems that let agents work reliably”

That’s the one point I think is most worth committing to memory.

Key takeaways#

1. The slogan isn’t AI replacing humans — it’s “Humans steer, agents execute”#

The single most important line in this article can be boiled down to:

Humans steer, agents execute.

What the OpenAI team ran was an extreme experiment:

  • Start from an empty repository
  • Have nearly all the code generated by Codex
  • Including:
    • Application logic
    • Tests
    • CI configuration
    • Documentation
    • Observability definitions
    • Internal tooling
  • Humans don’t write code directly; they drive output through prompts, review, design constraints, and system building

The article mentions they started with three engineers, later scaled to seven, and over roughly five months shipped a real product approaching 1 million lines of code, opening and merging around 1,500 PRs.

So this article is no longer about “can AI write a demo?” It’s about:

  • A production-grade project
  • Continuous deployment
  • Incidents and fixes
  • Real users
  • An agent-first way of organizing software engineering

2. The real bottleneck is no longer writing code — it’s environment design#

One of the article’s strongest observations:

Early progress was slow, not because Codex couldn’t write code, but because:

  • The environment wasn’t well-defined enough
  • The tooling wasn’t complete enough
  • The structure wasn’t clear enough
  • The feedback loops weren’t direct enough

In other words, the problem wasn’t that the model couldn’t do it — it was that the system hadn’t been built to let the model reliably do it.

So the engineer’s new responsibilities become:

  • Make goals decomposable
  • Make context readable
  • Make rules enforceable
  • Make results verifiable

Once a project gets stuck, the way to think about it is no longer “let me manually fill in this feature,” but rather:

What capability is the agent actually missing right now? And can that capability be added to the system explicitly?

3. The repo has to become a knowledge base the agent can actually read#

This is the part of the article I agree with most.

For an agent, knowledge that lives outside the repo essentially doesn’t exist:

  • Slack discussions
  • Tacit experience inside someone’s head
  • Verbal consensus buried in Google Docs

If it hasn’t made it into the repo, hasn’t been versioned, and hasn’t been organized in a structured way, it won’t reliably enter the agent’s operating context.

So in the end, their approach wasn’t to maintain one giant AGENTS.md, but rather:

  • Use a short AGENTS.md as a map
  • Put the real knowledge in a structured docs/
  • Make documentation the record system
  • Continuously clean and update docs via lint / CI / a doc-gardening agent

The philosophy behind this is powerful:

Give the agent a map, not a 1,000-page manual.

I think this is an especially important lesson for any agent-first repo.

4. Design the system for “agent readability”#

Traditional software engineering talks about human readability. The article emphasizes another layer:

agent readability

That is:

  • Can the repo let the agent quickly locate knowledge?
  • Can the UI be read and understood by the agent?
  • Can logs / metrics / traces feed directly into the reasoning loop?
  • Can the agent reproduce a bug, verify the fix, and collect feedback on its own?

The OpenAI team did a few crucial things:

a. Let Codex read the UI directly#

  • Hook into the Chrome DevTools Protocol
  • See DOM snapshots
  • See screenshots
  • Navigate and verify the UI

b. Let Codex read observability directly#

  • Logs are queryable
  • Metrics are queryable
  • Traces are queryable
  • Each worktree maps to a temporary, isolated observability environment

c. Let Codex self-serve debugging inside a worktree#

  • Every change can spin up an isolated instance
  • Reproduce the bug
  • Verify the fix
  • Record before/after comparison videos

I think this point matters a lot, because it shows:

An agent’s true capability ceiling is often determined by the boundaries of what the system makes observable, verifiable, and controllable.

5. Architectural boundaries must be mechanized, not remembered by humans#

The article stresses this repeatedly:

  • You can’t rely on “team conventions” alone
  • You have to write the boundaries into linters, structural tests, and CI rules

They adopted a very strict layered architecture:

  • Types -> Config -> Repo -> Service -> Runtime -> UI

Plus a unified Providers entry point to handle cross-cutting concerns, such as:

  • auth
  • telemetry
  • connectors
  • feature flags

The key point:

  • Let the agent be free in local implementation details
  • But have zero tolerance on boundaries and dependency directions

This is a lot like:

Be extremely strict about system structure, and reasonably permissive about the specific way things are expressed.

I think this is exactly the right engineering discipline for the agent-first era.

6. Lint / rules / “golden principles” are worth even more in the agent era#

When humans drive development, a lot of conventions can feel a bit tedious. But when agents drive development, those same conventions become multipliers instead.

Because:

  • Once a rule is explicit
  • The agent can follow it continuously and at scale
  • And enforce it consistently across all PRs

The article mentions they mechanize a lot of “taste judgments,” such as:

  • Naming conventions
  • File-size limits
  • Structured-logging rules
  • Boundary-validation principles
  • Avoiding YOLO-style data probing

They sum these up as a more subjective but still enforceable class of “golden principles.”

I strongly agree with this, because what it really amounts to is:

Distilling human taste into system constraints that a machine can enforce continuously.

7. In a high-throughput agent environment, “waiting” is more expensive than “making mistakes”#

The article also has a counterintuitive but very real point:

When agent throughput is extremely high:

  • The cost of fixing a mistake after the fact goes down
  • The cost of waiting on a human blocker goes up

So they removed a lot of the traditional human-imposed blocking merge gates. Short-lived PRs, fast patches, and continuous refactoring may be more sensible than obsessing over zero defects in a single merge.

This isn’t to say quality doesn’t matter — it’s to say:

In an agent-heavy workflow, quality control should rely more on automated feedback loops, rather than human gatekeeping that throttles throughput.

8. “AI sludge” is a real problem and demands continuous garbage collection#

I find this framing especially vivid:

  • Letting agents write all the code constantly produces pattern drift and “AI sludge”
  • If you don’t deal with it, the codebase slowly accumulates entropy

At first they spent 20% of every Friday cleaning up by hand, which obviously doesn’t scale. So they systematized this too:

  • Run background Codex jobs on a schedule
  • Scan for deviations
  • Update quality grades
  • Automatically open small refactoring PRs

It runs continuously, like garbage collection.

This line of thinking is worth remembering:

Technical debt isn’t something you pay off all at once at the end — you pay it down continuously in small amounts, like servicing a high-interest loan.

Current understanding / conclusions#

My core read on this article is:

It’s not a “AI makes coding faster” promo piece#

What it’s really about is:

  • What an agent-first repo should look like
  • Where the future engineer’s leverage lies
  • Where the discipline of software engineering should be placed

The most important shift: from code craftsmanship to systems craftsmanship#

The old emphasis was:

  • How to hand-write better code

What matters more now is:

  • How to design a better environment
  • How to make context readable to the agent
  • How to automate the feedback loop
  • How to turn boundaries and taste into an enforceable system

The three keywords that resonated most with me#

  1. Map, not manual
    • Give the agent a map, not a verbose manual
  2. Agent readability
    • Everything must account for whether the agent can reason about and verify it directly
  3. Harness engineering
    • The engineering focus moves from writing the implementation to building the harness that lets agents work efficiently

Implications for my actual work#

If I translate this article into more concrete advice for my own work, the most useful points are:

1. The repo should become a real knowledge system#

  • Key consensus must go into the repo
  • Docs must be structured
  • AGENTS.md should be short, stable, and map-like

2. Let the agent see the UI and observability directly#

  • Screenshots
  • DOM
  • Logs
  • Metrics
  • Traces

These aren’t “nice-to-have extras” — they’re core infrastructure for an agent workflow.

3. Write boundaries and taste as code#

  • lint
  • CI
  • structural tests
  • golden rules

Whatever can be automated shouldn’t rely on human memory.

4. Keep doing doc gardening and AI-sludge cleanup#

  • Docs rot
  • Patterns drift
  • You have to clean continuously

5. The scarcest human resource becomes attention#

There’s an underlying logic in the article that rings very true:

The scarcest thing in the future isn’t code output — it’s human time, attention, and judgment.

To be added#

Directions worth fleshing out later:

  1. How to port this workflow to smaller teams / personal projects
  2. How exec plans should actually be written in an agent workflow
  3. Automation design for doc gardening / repo linting / quality-drift management
  4. How to redesign the quality system once PR lifecycles get shorter
  5. What this pattern means, respectively, for frontend, full-stack, and infrastructure projects

🗂️ A research from the knowledge base.

← Back