Vibe Coding vs Real Engineering: When AI-Generated Code Breaks

How much can you trust 'vibe coded' AI output for production systems? Where AI shines, where it silently fails, and the engineering guardrails that actually work.

Published · May 8, 2026 7 min read By Softronic aiengineeringcode-qualityvibe-coding

A founder told us last month: “We don’t need engineers anymore. Cursor writes 90% of our code.” Two weeks later he called back because his auth flow was leaking session tokens to the URL fragment, his Postgres connection pool was exhausted under any real load, and his Stripe webhooks were processing payments twice on retries. All three bugs were in code that “looked fine” and passed the tests the AI also wrote.

This isn’t an AI-bashing post. We use AI every day. It’s a post about the difference between vibe coding and engineering with AI as a tool, and why the second one is the only one that survives production.

Defining vibe coding

The term was coined by Andrej Karpathy in early 2025 to describe a workflow where you prompt an LLM, accept what comes back, and ship without reading it carefully. The vibes are good. The code feels right. Tests pass. Ship it.

In 2026 the practice has spread far beyond Karpathy’s original tongue-in-cheek framing. We’ve seen Series A startups with 30% of their backend committed by a junior engineer pressing tab in Cursor with no senior review on the PR. We’ve seen founders building MVPs entirely in Bolt and Lovable and then handing the codebase to a contractor with “make this scale to 10K users.” The contractor quotes a full rewrite. The founder is shocked.

Vibe coding produces code that works for the happy path the model was thinking about. That’s it. Everything else is unverified.

Where vibe coding actually works

We’re not anti-AI. Here’s where the workflow is genuinely fine:

Throwaway scripts. One-off data cleaning, log parsing, a quick migration tool you’ll run once. The cost of a bug is “run it again.”
Prototypes that won’t ship. Internal demos, design tests, prove-the-concept apps you’re going to throw out next week.
UI scaffolding. Generating a settings page, a form, a list view. The visual feedback is immediate; broken UI is obvious.
Boilerplate. Routes, basic CRUD, type definitions, simple test fixtures. The patterns are well-trodden enough that the model rarely invents.
Pull-request descriptions, docs, changelogs. Anywhere the output is human-reviewable text, not machine-executable code.

In all five cases, the cost of failure is low and the failure is loud. Vibe away.

Where vibe coding silently fails

The dangerous category is code that runs, passes tests, and is wrong. Five places we see this consistently:

1. Race conditions and concurrency

LLMs are statistical models trained on code as text. Concurrency bugs don’t usually appear in the text — they appear in the runtime behavior. A model will happily write a “lock-free counter” that has a subtle data race because it’s seen patterns that look similar in its training data.

Real example from a client codebase last quarter: a vibe-coded background job processor that polled a Postgres queue without SELECT ... FOR UPDATE SKIP LOCKED. Two workers picked the same job. The Stripe charge ran twice. The customer disputed. The “tests” had only one worker.

2. Security boundaries

The model knows enough to add bcrypt and call it auth. It doesn’t know enough to think about:

Session fixation vs session regeneration on login
The CSRF implications of the specific framework you’re using
Whether your JWT secret is being read at module load time and cached in a way that breaks rotation
The 14 ways your file-upload endpoint can be turned into an SSRF

We’ve audited 22 vibe-coded codebases in the last 9 months. Every single one had at least one OWASP Top 10 vulnerability in shipped, production code. The median was 4.

3. Edge cases the model didn’t think to test

A model writes a function. The model writes tests. The tests cover what the model was thinking about. There is no adversary in the loop. The user who enters '; DROP TABLE users;-- as their first name, the integer that overflows on month 13, the network call that times out partway through, the timezone that’s 30 minutes off from UTC — none of these were in the model’s context, so none are tested.

You don’t catch what you don’t think to look for. Senior engineers have scar tissue from previous outages that makes them paranoid in the right ways. Models don’t.

4. Distributed systems and consistency

The single hardest category. The model will confidently write code that “saves to the database and then sends a notification.” That’s a dual-write problem. If the notification service is down, the database is updated and the notification is lost. If the database is down after the notification, the user gets told something happened that didn’t.

Real engineering involves outbox patterns, idempotency keys, sagas, compensating transactions. The model will not reach for these unless you specifically prompt for them, and even then it often produces a half-correct version that breaks in subtle ways under partial failure.

5. Performance under realistic load

LLMs love Promise.all with 10,000 items. They love N+1 queries dressed as elegant .map() chains. They love loading the entire users table into memory because “it’s clean.”

These problems don’t appear in any test. They appear when your customer with 200K rows in their account opens the dashboard and waits 8 seconds.

The cost of cleaning vibe-coded production code

We see this a lot. A founder ships fast with vibe coding, gets traction, then realizes the codebase can’t take a second engineer because nobody can reason about it. They call us.

Average cost of a vibe-cleanup engagement for a 30K-LOC codebase: $30K-$70K and 6-10 weeks. Average cost of having written that same codebase with engineering discipline from week one: $25K-$50K and 8-12 weeks.

The vibe path looks cheaper at the start. It is almost always more expensive over 18 months once you count the cleanup, the customer-facing bugs in the meantime, and the cofounder time spent debugging instead of selling.

The 5 guardrails that actually work

If you’re going to use AI for production code (you should), here are the five guardrails we use across every engagement.

Guardrail 1 — Code review is non-negotiable

Every AI-generated PR gets reviewed by a human engineer who can read it line by line and reason about it. Not “scan for vibes.” Read.

This means the AI is a productivity multiplier on people who already know how to write good code. It is not a replacement for someone who knows what good code is.

If your reviewer is also vibe coding, you’ve collapsed the guardrail. Two LLMs nodding at each other.

Guardrail 2 — Test coverage requirements, but the right ones

Don’t measure raw coverage percentage. Measure: do you have a test for every external boundary (API, DB, third party), every auth path, every state machine transition, every retry/error branch?

We use a “boundary test budget” instead of coverage targets. Coverage gets gamed by tests that exist to make the number go up. Boundary tests check the things that actually break.

Guardrail 3 — Eval pipelines, not just unit tests

For any AI-assisted feature that involves business logic, we maintain an eval suite: a set of representative real-world inputs (including the adversarial ones) that runs on every PR. This catches the “model regenerated this function and it now fails on case X” class of bug.

Evals are cheap. We’ve never regretted having them. We’ve regretted not having them many times.

Guardrail 4 — Scoped autonomy

AI agents and Claude/Cursor sessions get scoped permissions, not full repo access. Specifically:

Cannot push to main directly
Cannot run database migrations against production
Cannot deploy
Cannot read or write secrets
Can read source, suggest changes, open PRs

The blast radius of a vibe is bounded by what the AI can actually do.

Guardrail 5 — Paired engineer-AI workflow

Our standard model: a senior engineer drives, AI assists. The engineer holds the mental model of the system, the AI accelerates the typing and lookup work. Not the other way around.

The anti-pattern: a junior or non-engineer prompts the AI, and the AI holds the mental model. Nobody on the team can debug the result because nobody on the team understands it.

Real bugs from real codebases (anonymized)

A small collection from the last 90 days:

Stripe webhooks processed twice. Missing idempotency key check. AI wrote the handler, didn’t include the idempotency table.
Auth token leaked in URL fragment. AI generated an OAuth callback that put the access token in the URL hash. Cookies would have been the right move.
Postgres connection pool exhausted at 50 concurrent users. AI used a new pool per request instead of a shared one. Worked fine in test.
Cron job ran 24 times instead of once. AI configured the cron at the application level and in Kubernetes, both running.
GDPR right-to-deletion request silently failed. AI deleted from the users table but not from the 6 join tables. No error. Looked successful.

None of these were caught by tests. All of them were caught by senior code review.

Where Softronic sits

We use AI every day. Every senior engineer on our team has Claude Code or Cursor open during work. We commit faster than we used to.

We also review every line that ships to production. We pair-program. We write evals. We keep humans in the architectural loop. The output is faster than 2023 engineering and more reliable than vibe coding.

If you’ve shipped a product on vibe coding and you want a real engineering team to harden it, that’s a thing we do. If you’re starting from scratch and want it built right the first time, that’s also a thing we do.

Ready to ship code that doesn’t break in production?

Custom builds from $15K, 6-14 weeks, fixed price after week one. Senior engineers, AI as a tool, not a replacement.

Read more about how we build at custom software or our full services menu.