
On the one-year anniversary of coining “vibe coding,” Andrej Karpathy proposed replacing it with “agentic engineering.” The distinction he drew was precise: vibe coding is describing what you want and accepting what comes back. Agentic engineering is designing the system, specifying the constraints, and using AI to accelerate implementation you have already reasoned through. One is expression. The other is engineering.
Most software organizations are running both simultaneously and calling them the same thing. That is where the expensive mistakes are coming from.
One of my development leads put it plainly — not as a policy position, but as an empirical observation. In his experience, vibe-coded PRs consistently arrive missing edge case handling, error paths, and exception logic. Not because the AI forgot them.; it’s because the developer never specified them. They described an outcome, accepted what the agent produced because it looked right, and submitted it. The tests pass because they were written against the code that exists, not against the behavior the system actually requires.
The agent did not make something up. The developer did not know what to ask for.
His response is not to reject AI coding tools. It is to require that engineers demonstrate they understand what was generated — the edge cases, the scaling assumptions, the failure modes — before the PR gets merged. If you cannot explain why the solution is designed the way it is, you did not design it. You accepted it.
He is right. And the data backs him up. PR review times on heavily AI-assisted teams are up 91% — not because AI is writing worse code, but because reviewers are now responsible for reconstructing the comprehension that the developer skipped. That is a harder review, not an easier one. And it is compounding.
What AI Did to the Roles — and What It Didn’t
There is a widespread assumption among technology leaders that AI coding tools collapsed the distinction between who builds and who reviews — that the agent writes well enough that the old quality gates are a legacy of a slower era.
That assumption confuses velocity with comprehension.
The developer, the tester, the architect — these roles were never primarily about producing artifacts. They were about understanding the system well enough to know when something was wrong before it became someone else’s problem. The developer who spots a race condition spotted it because they understood the execution model. The tester who asks “what happens when the user does the unexpected thing?” asked it because they reasoned through the system’s behavior. The architect who recognizes that this solution works now and will break at scale recognized it because they held the whole system in their head.
These are not production tasks. They are comprehension tasks. You cannot delegate comprehension to an agent.
What changed is that you can now produce a hundred lines of code without having done the thinking that a hundred lines of code used to require. The output exists. The understanding behind it may not. An engineer reviewing a vibe-coded PR is not reviewing code — they are trying to reconstruct whether the developer who submitted it actually understood what they were building.
The roles are not dissolving. They are being stress-tested. The developer who designed the solution — who can explain every edge case, every failure mode, every scaling assumption — is more valuable than before. The one who accepted what the agent produced because it looked right and the tests passed is now a liability at the speed the organization is moving.
Three Failure Modes Engineering Managers Need to Watch For
These are not hypotheticals. They are patterns repeating across organizations deploying AI coding tools at scale.
The green pipeline problem. A green pipeline means the code does what it was asked to do. It does not mean the developer asked the right thing, or asked completely enough. A senior engineer knows to look behind the green. A manager who has stepped too far from the work cannot tell from a dashboard whether green means safe or means fast and unexamined.
The missing path problem. The developer who does not understand the system’s failure modes cannot specify them. The agent cannot surface what the developer did not know to require. In a production system, the happy path is where things work. The unhappy paths are where you find out what the system is actually made of. AI agents, as Karpathy noted, were purpose-built for the first 80% of an application — the implementation that flows naturally from a well-described intent. The last 20% — the edge cases, the failure recovery, the scaling constraints — requires a developer who has actually thought through the system. That 20% is where vibe-coded code consistently runs out.
The confidence calibration problem. AI-generated code reads as authoritative. The structure is clean, the naming is coherent, the comments are present. It does not look like code written by someone who was uncertain — even when the underlying logic contains a bet that something will never happen. Human code carries the fingerprints of doubt: the comment that says “TODO: handle this case,” the defensive check that signals the developer was not sure. AI code often lacks those signals. Reviewers have to supply the doubt themselves. That requires judgment the reviewer can only exercise if they understand the system well enough to know what to doubt.
What Engineering Leaders Need to Do Differently
There is a version of technical leadership that sounds sophisticated and is quietly dangerous in this environment: the manager who has stepped back from the code to focus on delivery metrics, who measures the AI program by velocity numbers and adoption rates, and who interprets a senior engineer’s insistence on deep code review as resistance to change.
That manager is optimizing for the output of the process rather than the quality of the judgment being applied to it. In a fast-moving AI environment, that is a compounding error.
Technical proximity is not micromanagement. It is not writing code or reviewing every PR. It is being close enough to the actual behavior of the systems you are accountable for that you can tell the difference between a team moving fast because they are disciplined and a team moving fast because they skipped the hard part.
The manager who cannot read a PR does not need to review every one. But they need to understand what their senior engineers look for when they do. That distinction — between “this passed the tests” and “this is right” — is not available from a summary. It is available from contact.
My team runs three rituals that have nothing to do with status updates and everything to do with maintaining that contact.
Two hours every week in an architecture working session. Two hours every other week in sprint planning. Two hours each sprint demoing to the whole team.
The architecture sessions are where the system’s reasoning lives — not the tickets, not the documentation, but the living conversation about why things are designed the way they are and what the options were that weren’t taken. A manager who sits in those sessions for six months builds a working model of the system that no dashboard can replicate.
Sprint planning is where the disconnects surface. We use planning poker — everyone estimates independently before the reveal. When estimates diverge sharply, the conversation that follows is almost always the most valuable one of the sprint. Not because we are negotiating a number. Because divergent estimates mean divergent mental models. Someone thinks this task is a 2. Someone else thinks it is a 13. That gap is not a disagreement about effort. It is evidence that two people are not looking at the same problem.
Divergent estimates don’t measure complexity. They measure where your team’s understanding of the system breaks down.
The demos keep everyone honest about what was actually built versus what was intended, cross-train the team across what each person is working on, and give the manager the most important signal of all: whether the people building the system can explain what they built and why the tradeoffs they made were right.
An AI agent can produce a demo. It cannot explain its reasoning under questioning. The engineers who can are the ones you cannot afford to route around.
Karpathy’s reframe from vibe coding to agentic engineering is not a terminology update. It is a professional obligation.
The organizations that ignore AI will fall behind. The ones that vibe it will ship failure at scale. The ones that engineer it — deliberately, with comprehension at every layer — are the ones building something worth running in production.
That is not a productivity conversation. That is a responsible AI conversation. The code looks finished. The pipeline is green. The PR is open.
Whether it is actually ready is still a human call. Make sure your team — and you — are close enough to the work to make it.
