Vibe Coding's Reality Check

The tools are real, the numbers don't lie — but the missing ingredient was never a better model.

Mar 24, 2026

The developers who felt 20% faster with AI tools were actually 19% slower. That gap — between perceived speed and actual throughput — is the most important number in the vibe coding story.

It doesn’t invalidate the tools. It doesn’t vindicate the skeptics. It points directly at the thing the industry has been reluctant to name: the bottleneck was never code generation. It was judgment.

Fourteen months ago, Andrej Karpathy gave this moment a name. The tools he was describing are now multi-billion-dollar businesses. The practices he described — “Accept All,” “I don’t read the diffs anymore” — are showing up in production incidents, security disclosures, and a category of engineering dysfunction that has acquired its own term: cognitive debt. By February 2026, Karpathy himself had moved on, proposing “agentic engineering” as the replacement framing.

What follows is a map of what happened, what actually works, what the numbers say, and what the missing ingredient is. It isn’t a better model.

The Moment It Had a Name

On February 2, 2025, Karpathy tweeted: “There’s a new kind of coding I call ‘vibe coding’, where you fully give in to the vibes, embrace exponentials, and forget that the code even exists. It’s possible because the LLMs are getting too good.”

The post got 4.5 million views. It wasn’t a product launch or a benchmark claim. It was a confession. “I ‘Accept All’ always,” he continued. “I don’t read the diffs anymore. When I get error messages I just copy paste them in with no comment, usually that fixes it.”

Something about that honesty landed — because it described what a lot of people were already doing, and nobody had said it out loud yet.

The term did something that “AI-assisted coding” and “copilot” never managed: it captured the psychological shift. You weren’t augmenting your programming. You were handing off the programming and keeping the intent. The human role became specifier, reviewer, and ship-or-don’t-ship decision-maker. The machine became the implementer.

For throwaway prototypes and weekend demos, this was genuinely fine. The code didn’t need to survive the week. If it ran, it worked. “Accept All” was a rational posture under those conditions. The problem was what happened when people carried that posture into production.

“Vibe coding” became a Rorschach test. To non-technical founders it represented freedom from the gatekeeping of engineering knowledge. To experienced engineers it represented the risk of shipping systems nobody could reason about. To the industry it represented a market opportunity large enough to justify nine-, ten-, and eleven-figure valuations. All three readings were correct at once, which is why the conversation has been so hard to navigate.

Collins Dictionary named it Word of the Year for 2025. By February 2026, the word’s coiner had moved past it. His retrospective was honest about the original context: “At the time, LLM capability was low enough that you’d mostly use vibe coding for fun throwaway projects, demos and explorations. Today, programming via LLM agents is increasingly becoming a default workflow for professionals, except with more oversight and scrutiny.”

The word “oversight” is doing a lot of work there.

Fourteen Months of Hype in Numbers

The market growth has been extraordinary by any measure, and the numbers deserve to be stated plainly rather than buried in qualifications.

Cursor — the AI-native VS Code fork that became the standard for professional developers who want maximum productivity without platform dependency — went from a $9.9 billion valuation in June 2025 to $29.3 billion by November, after raising $2.3 billion in a Series D and crossing $1 billion in ARR. Enterprise revenue grew 100x within 2025. As of March 2026, Cursor is in talks to raise at a $50 billion valuation with ARR exceeding $2 billion. It did not exist as a commercial product three years ago.

Lovable went from zero to $100 million ARR in its first eight months. It reached $200 million by November 2025, $300 million in January 2026, and $400 million in February 2026 — $300 million of new annual revenue in seven months, with 146 employees and 8 million users generating 200,000 new projects every day. Its December 2025 Series B valued it at $6.6 billion.

Replit hit $253 million ARR by October 2025, up from roughly $16 million at the start of the year — a 16x increase in twelve months — and raised $400 million at a $9 billion valuation. It’s targeting $1 billion ARR by end of 2026. Forty million users, 75% of whom never write code themselves.

GitHub Copilot crossed 20 million all-time users by July 2025. By January 2026 it had 4.7 million paid subscribers, 75% year-over-year growth, and 90% of Fortune 100 companies using it in some form. Forty-six percent of all code written by active Copilot users is now AI-generated, up from 27% at launch.

Stack Overflow’s 2025 developer survey found 84% of developers using or planning to use AI tools. Google’s DORA 2025 report, drawing on nearly 5,000 technology professionals, put AI adoption in software development at 90%, with professionals spending a median of two hours daily working with AI. YC Winter 2025 reported 25% of startups had AI generating 95% or more of their codebase. The AI coding tools market reached approximately $7.88 billion in 2025. Gartner forecasts 60% of all new code will be AI-generated by the end of 2026.

None of this is hype in the pejorative sense. These are real numbers from real businesses with real customers making real purchasing decisions. What they don’t tell you is what those customers are actually building, how long it survives in production, or who fixes it when it breaks.

A Map of the Ecosystem

The biggest conceptual error in most vibe coding coverage is treating these tools as a single category. They are three distinct tiers with different jobs, different user profiles, and different risk surfaces. Conflating them produces bad decisions about adoption, staffing, and expectations.

Tier 1: IDE-Level Assistants are tools for engineers who already know what they’re doing. The AI augments their capability; it doesn’t replace their judgment. Your code lives in your repository. The AI is a collaborator, not a platform.

Cursor is the standard-bearer — a VS Code fork with deep codebase awareness that can reference your entire repository, propose multi-file edits, and run background agent tasks while you review. GitHub Copilot is the most widely adopted in enterprise settings, largely because the GitHub relationship simplifies procurement. Windsurf (formerly Codeium, acquired by Cognition in July 2025) is a strong Cursor competitor, though the acquisition introduces product direction questions still unresolved. Claude Code is the outlier — terminal-first, CLI-based, designed for engineers comfortable thinking in architecture and working across large complex codebases. It consistently performs best on complex reasoning tasks. The tradeoff is the highest floor for effective use: you need to know what you’re asking for, and you need to be able to evaluate what comes back.

Tier 2: Autonomous Agents handle entire tasks, not just lines or functions. You give them an issue description or a spec; they plan the approach, write code, run tests, fix failures, and open a pull request.

Devin (Cognition AI) defines this category. After a rocky initial launch, the 2025 and 2026 versions tell a more grounded story: 67% of its pull requests are now merged, up from 34% a year ago. It’s running in production at Goldman Sachs, Nubank, Citi, Dell, Cisco, and Palantir. In March 2026, an update shipped the ability for Devin to orchestrate parallel Devins — planner, coder, and reviewer agents working simultaneously. Multi-agent coordination is now a production feature, not a roadmap item. Claude Code in its agentic configuration and GitHub Copilot Agent Mode are both moving into this tier for long-horizon tasks.

The critical framing for engineering managers: Tier 2 tools are not junior developers. They’re more like very fast contractors who perform well on well-scoped tasks and make confident, difficult-to-detect mistakes when the specification is ambiguous.

Tier 3: SaaS App Builders are where “AI coding” might be a misnomer — they’re closer to “AI product building.” You describe what you want and receive a working application. No IDE required, no prior coding knowledge assumed.

Lovable leads this category — chat-first, full-stack, with GitHub sync, Supabase integration, and Stripe support. Bolt.new is browser-based, generates real React code, and is purpose-built for prototypes and hackathon projects. Replit is the most complete platform in this tier — code, database, authentication, and hosting under one roof, with 75% of its users never writing code. v0 (Vercel) handles frontend only, producing clean portable Next.js code with roughly 90% success rate on well-specified UI tasks. It’s the lowest lock-in option in this tier, by deliberate design.

The lock-in gradient matters for every evaluation conversation: Claude Code and Cursor (code lives in your repo, zero platform dependency) → v0 (clean portable Next.js exports) → Bolt (real exportable React code) → Lovable (GitHub sync helps, runtime coupling is real) → Replit (technically exportable, high friction in practice). More magic equals more vendor dependency. That line should be visible in every adoption conversation.

The AI coding ecosystem in three tiers from augmentation tools to full app builders, with vendor lock-in increasing as the magic does.

Where It Actually Works

The success cases are real and worth understanding carefully, because they share a pattern.

Anything.com shipped a habit tracker, a CPR training app, and a hairstyle try-on tool — all to the App Store, all generating revenue — with no developer on payroll. $2 million ARR in its first two weeks. Theanna, a non-technical solo founder, reached $203,000 ARR using Claude Code at $216 per month in tool costs. Her own framing of what she’d learned: “Vibe coding makes you faster. It doesn’t make you smarter about what to build.”

The enterprise numbers are equally concrete. Booking.com saw up to a 30% increase in merge requests when developers learned better prompting practices. TELUS saved over 500,000 hours across its developer organization. Zapier reached 89% AI adoption internally. A DX platform analysis of 135,000 developers in Q4 2025 found that daily AI tool users merge roughly 60% more pull requests than non-users and save an average of 3.6 hours per week.

The Nubank deployment makes the underlying pattern explicit. Devin was used for large-scale monolith refactoring in January 2026, achieving 8x engineering efficiency and 20x cost savings. This wasn’t “Accept All” in the original sense — it was a senior engineering team with a precise specification delegating well-understood, repetitive work to an autonomous agent and reviewing the results. Goldman Sachs is running the same model at a different scale: Devin piloted alongside 12,000 developers, targeting 20% efficiency gains — effectively getting 14,400 developers’ output from 12,000 headcount.

The pattern across every success case is consistent: the wins happen when a human with clear requirements and the ability to evaluate output uses AI to eliminate the implementation burden. The tool is fast; the human is the quality gate. The failure modes happen when that quality gate is absent.

The Reality Check

Three bodies of evidence from 2025 and early 2026 complicate the productivity narrative in ways the industry hasn’t fully absorbed.

In July 2025, METR published the most rigorous independent study on AI coding productivity to date. Sixteen experienced developers — averaging five years and 1,500 commits on large open-source repositories averaging 22,000 GitHub stars and over one million lines of code — used Cursor Pro with Claude 3.5 and 3.7 Sonnet across 246 tasks. The result: developers took 19% longer with AI tools than without. Before the study, those same developers expected to be 24% faster. Economists studying AI predicted a 39% speedup. ML researchers predicted 38%. The actual outcome was a slowdown.

More revealing: after completing the tasks, developers still believed they had gone 20% faster — despite the measured result. Seventy percent continued using Cursor after the experiment anyway. The value they perceived was real; the speedup they attributed to it wasn’t.

The gap between how fast developers felt they were moving and how fast they actually were \u2014 the METR study's most unsettling finding. — The gap between how fast developers felt they were moving and how fast they actually were the METR study’s most unsettling finding.

“When people report that AI has accelerated their work,” the researchers noted, “they might be wrong.”

The mechanism is straightforward. The generation phase is fast and satisfying — you get code immediately. The validation phase is slow, cognitively demanding, and often invisible in retrospect: prompting precisely, reviewing output you didn’t write, correcting “almost right” code that looks plausible but fails in ways a fresh read doesn’t catch. On complex mature repositories, these costs compound. A February 2026 METR follow-up acknowledged that early-2026 models are likely more productive than those studied. The 19% figure is a calibration, not a verdict — but it’s an important one against the assumption that faster generation automatically means faster delivery.

The code quality data tells a parallel story. In December 2025, CodeRabbit analyzed 470 open-source GitHub pull requests and found that AI-assisted code produced 1.7 times more major issues than human-written code. Misconfigurations were 75% more common. Security vulnerabilities appeared at 2.74 times the rate — specifically, XSS vulnerabilities in AI-assisted code were 2.74x more common than in human-written code. A broader analysis of 5,600 vibe-coded applications found over 2,000 vulnerabilities, 400 exposed secrets, and 175 instances of exposed PII. Fifty-three percent of teams that shipped AI-generated code later discovered security issues that had passed initial review. Georgetown CSET research found 86% of AI-generated code failed XSS defense mechanisms. These are systemic patterns, not outliers from careless developers.

Google’s DORA 2025 report found 90% AI adoption across nearly 5,000 technology professionals. The headline finding: AI adoption has a positive relationship with software delivery throughput — teams ship more, faster. But it continues to have a negative relationship with software delivery stability. More output, more incidents. The DORA framing is worth quoting directly: “AI doesn’t fix a team. It amplifies what’s already there.”

The production incidents make the abstract concrete. The Lovable credential leak in May 2025 was predictable in retrospect: 170 of 1,645 analyzed applications had hardcoded database credentials in client-side code, accessible to anyone. The LLM generating that code had no model of what “production” means, no concept of what data should never appear in a client bundle, and no security reviewer in the loop. Attackers found the credentials. In February 2026, the Moltbook platform — entirely vibe-coded, by the founder’s own public account — was found by Wiz with its Supabase database left with public read and write access. In December 2025, a structured assessment across 15 applications built by five AI coding tools identified 69 total vulnerabilities, approximately six rated critical.

None of this is an argument against using the tools. It’s an argument for understanding what they do and don’t provide — and for building the human layer that bridges the gap.

The Judgment Gap

Here is the core argument, stated plainly.

The productivity gains are real. The risks are real. Both things are true simultaneously. What determines which one dominates is something the tools cannot provide: engineering judgment.

Judgment is not intuition. It’s not experience measured as accumulated time. It’s the ability to evaluate what’s in front of you against what production-ready looks like — to understand the tradeoffs being made, recognize when a solution is technically wrong even if it compiles, and know which constraints matter in context. It’s the skill of knowing what to reject.

Karpathy’s evolution from February 2025 to February 2026 is the clearest illustration. In February 2025: “Accept All, don’t read the diffs.” Rational framing for throwaway projects — he was right for that context. In February 2026: orchestrating agents, acting as oversight, applying engineering expertise to the judgment layer rather than the implementation layer. The skills didn’t disappear. They got redirected.

His framing for “agentic engineering” was deliberate: “’agentic’ because the new default is that you are not writing the code directly 99% of the time — you are orchestrating agents who do and acting as oversight. ‘Engineering’ to emphasize there is an art and science and expertise to it.” Oversight is not rubber-stamping. Oversight means you understand what the agent is trying to do, you can evaluate whether it’s doing it correctly, and you can catch the class of errors the agent consistently makes. Remove the person with that knowledge from the loop and you don’t have oversight; you have an unreviewed deployment pipeline.

The METR study’s most interesting finding isn’t the 19% slowdown — it’s the persistent perception gap. Experienced developers on complex codebases, using sophisticated tools, genuinely could not accurately assess whether those tools were helping them. Their intuition said yes. The clock said no. Experienced intuition — the ability to read a diff in two seconds and know whether it’s right — was being traded for generated output that required careful reading to evaluate. The time cost was the same or higher; it just felt different. If the metric you’re optimizing for is “feeling productive,” AI tools perform excellently. If the metric is “time to shipped, working, maintainable code,” the picture is more complicated.

The security failures in Tier 3 tools follow from the same dynamic. Those platforms are designed to minimize friction between intent and execution. For non-engineers with no baseline for what a production-ready authentication flow looks like, that friction was doing useful work. Remove it and insecure code ships faster, with more confidence.

What this means practically: engineers who understand their systems, can read and evaluate AI output critically, and know what verification actually requires become dramatically more productive. Engineers who treat AI output as correct by default, skip verification because the code “looks right,” and have no mental model of what they’re shipping become faster at creating problems. The judgment gap is not about ability. It’s about whether the skills that matter have been built. They can be. But they don’t come from the tools.

The skills involved don’t disappear in the agentic era — they get redirected. Syntax knowledge matters less. Architectural thinking, constraint-setting, and verification skill matter more. The ability to read code you didn’t write, at speed, with enough confidence to catch what’s wrong — that’s the job description now. It is not automatable. It requires exactly the kind of engineering experience that the “you don’t need to code anymore” narrative treats as optional.

The Vendor Lock-In Problem

Most coverage of Tier 3 tools focuses on what they can build. Substantially less attention goes to what happens after you’ve built on them.

The SaaS app builders are platforms. Your application runs inside their infrastructure, often with their database, their authentication layer, their hosting environment. This coupling is the source of their ease. It is also a liability that compounds with scale and time.

The lock-in gradient is not theoretical. Claude Code and Cursor sit at the low end: your code lives in your repository, the AI is a collaborator with zero infrastructure dependency. v0 produces clean portable Next.js that deploys anywhere, by deliberate design. Bolt generates real exportable React code. Lovable has GitHub sync, which genuinely matters for portability, but “the code is in GitHub” and “the team can maintain this independently” are different things — the runtime coupling between generation and hosting is real. Replit is technically exportable but in practice migrating a mature Replit application is a non-trivial project that few teams have done cleanly. More magic means more vendor dependency.

The Lovable credential leak and the Moltbook Supabase exposure weren’t unrelated to this problem. When generation and hosting are tightly coupled, the security properties of generated code depend on the platform’s security model. When that model fails, there’s no separation between the generation failure and the deployment failure. They’re the same system.

The question every team should ask before adopting any Tier 3 tool for anything beyond rapid prototyping: can we run this application without this platform? Not “can we export the code” — can we actually maintain, deploy, and extend it independently? If the answer is no, you’ve traded development velocity for infrastructure dependency. That’s not always wrong. It should be a conscious decision, made with the dependency named explicitly before you ship to production.

No Tier 3 tool currently delivers production-ready applications without manual refinement. The demo works. The gap between the demo and the production system is where the unasked questions live.

Where This Goes

The trajectory from 2025 to 2026 was: individual AI assistants → multi-file, multi-step agents → orchestrated multi-agent systems. Devin’s March 2026 update — parallel Devins operating as planner-coder-reviewer triads — is an early version of what comes next: engineering tasks decomposed across specialized agents in parallel, with humans providing constraints, reviewing outputs, and resolving conflicts. This is not a research prototype. It is the current production frontier.

Context engineering is emerging as a distinct discipline alongside this shift — the practice of managing what the AI knows when: which files, which conventions, which constraints, which architectural decisions have already been made. Getting this right is the difference between an agent that makes coherent architectural decisions and one that generates technically correct code that doesn’t fit the system it’s being added to. Engineers who develop this skill first will have a meaningful advantage over those who treat it as an implementation detail.

The scale indicators point in one direction. DORA 2025 found 90% AI adoption with median usage of two hours per day. The shift from “do I use AI?” to “how do I use AI effectively?” is already complete for most teams. Gartner’s 60% of new code AI-generated by end of 2026 may be slightly aggressive on timing, but the direction isn’t in question. McKinsey estimates AI-centric organizations are achieving 20–40% operating cost reduction — but those are organizations that built the human layer before scaling the AI layer.

Security is going to get worse before it gets better. The pace of code generation is outrunning the pace of security practice adoption. Forrester predicts 40% of businesses will use AI for automatic security remediation by 2026. That also means 60% won’t, even as AI-generated code scales into their systems at speed. The tooling is maturing faster than most organizations’ ability to use it well. That gap is where the next two years of incidents will come from — not from the tools failing, but from organizations treating tool adoption as a substitute for building the judgment infrastructure around them.

My Take

The tools are real. The numbers don’t lie. Developers using AI-assisted workflows are shipping faster — meaningfully faster, not rounding-error faster. I’ve seen it in my own work and I’ve watched it happen across teams I respect. That’s not the argument I’m here to have. What’s missing from most of the celebration is the institutional playbook — the hard, unglamorous thinking about what happens after the demo, after the prototype goes to production, after the engineer who built the whole thing with Lovable in a weekend gets a better offer and walks out the door.

That’s where vibe coding’s real cost lives. Not in the tool. In the handoff.

The moment the engineer who built it leaves and what’s actually handed to the next person.

When a developer who built something with an AI agent leaves, what exactly is the codebase they’re leaving behind? It’s a working application that nobody on the remaining team can read with any confidence. Not because the code is obfuscated — it’s usually pretty clean, syntactically — but because nobody built the mental model along the way. Nobody walked through the tradeoffs. Nobody watched a diff and thought hard about why the abstraction was built that way. The code shipped, the feature landed, and the reasoning evaporated. That’s not a new problem, but vibe coding accelerates it by an order of magnitude. You can now accumulate understanding debt faster than any previous generation of tooling allowed.

The onboarding question makes this concrete. How do you bring a non-engineer into ownership of code they didn’t write, that was itself generated by a system nobody fully supervised? You can’t. Not really. You can hand them the application, the prompts that built it, and a prayer. But the judgment that lives inside a codebase — why this library and not that one, why this schema, why this boundary between services — that judgment doesn’t transfer through a Notion doc. It was never written down because the person who made those calls didn’t make them consciously. They accepted what the model suggested, it worked, and they moved on. This is fine when the system is small and the author is still in the room. It becomes a serious liability at the scale most organizations actually operate at.

The skill decay piece is the one I find most underreported, because it’s slow and quiet and doesn’t show up in any sprint metric. Engineers who stop reading diffs — genuinely reading them, not just approving them — lose the ability to reason about systems over time. The muscle atrophies. It doesn’t happen in a month. It happens over a year of “that looks fine, ship it,” and by the time you notice, the engineer who used to be your most reliable reviewer now struggles to hold a whole service’s behavior in their head during an incident. I don’t think this is hypothetical. I think it’s already happening on teams that adopted AI-assisted development early and enthusiastically without thinking about what they were trading away.

There’s a real chasm opening up between early adopters and general organizations, and I want to name it precisely. Startups are celebrating outputs — and they should be, because the outputs are impressive and speed is survival. But most companies haven’t built the organizational capability to own what they’re generating. They can produce it. They can’t maintain it, extend it, or reason about it when something goes wrong at 2am. The early adopters set the pace. The institutions are trying to match the pace without building the foundation that makes the pace sustainable.

My actual position is this: use the tools aggressively. I do. I’m not arguing for restraint at the generation layer. I’m arguing for intentionality at the judgment layer — and that judgment does not come for free, it does not emerge automatically from using the tools long enough, and it will not be unlocked in a future model release. The tools give you speed. You have to supply the wisdom. That means making your reasoning visible even when the model does the typing. It means reviewing what you ship with the same rigor you’d apply to a junior engineer’s PR. It means building the mental model actively, not passively accepting whatever compiles.

Here’s the challenge I’ll leave you with: go look at something you shipped with AI assistance in the last six months. Not to run it — to read it. Can you explain every meaningful decision in that codebase? If you can’t, you don’t own it. You’re just hosting it. That’s a different thing, and you should know which one you’re doing.

Key Insight

The most honest thing you can say about vibe coding in 2026 is this: the tools have matured from novelty to infrastructure, but the missing ingredient was never a better model. It was always the thing that makes a diff worth approving — engineering judgment. That’s not being automated away. It’s being revealed as the job.

You can’t vibe your way past that. You can only get better at it.

Researched and drafted with Claude, Gemini, and OpenAI models. Editorial judgment by Mrityunjay Kumar (MJ).

Octal Flow

Discussion about this post

Ready for more?