AI Agents in Production: Five Hard-Won Lessons From Shipping in 2026

Demos lie. Production teaches. Five lessons from shipping AI agents to real users — eval pipelines, cost ceilings, fallback strategy, observability, and more.

Published · April 22, 2026 6 min read By Softronic aiagentsproductionllm-opsrag

Every AI agent demo we’ve seen this year works. Every AI agent in production has, at some point, hallucinated a refund, called a tool with garbage arguments, or burned $40 of token spend in a single retry loop. Demos are scripted on the happy path. Production is the unhappy path 30% of the time.

We’ve shipped agents for support triage, internal search, sales-call summarization, and structured data extraction over the last 12 months. Most of them work well now. None of them worked well on day one. Here are the five lessons that cost us the most to learn.

Lesson 1 — Eval pipelines are not optional. They are the product.

The single biggest mistake teams make: they write the agent, ship it to staging, eyeball ten outputs, and call it good. Three weeks later a user complains and nobody can tell whether the fix made things better or worse.

You need an eval pipeline before you write the second prompt. Not after launch. Not after the first regression. Before.

What we run for every agent we ship:

A frozen test set of 50-300 real inputs. Drawn from production logs once you have them. From founders, support tickets, or interview transcripts before that.
A scoring rubric per task. For factual extraction, exact match or JSON-schema validity. For summarization, an LLM-as-judge with a calibrated rubric plus 20% human sampling. For tool-using agents, success on the end-to-end outcome, not just intermediate tool calls.
Versioned prompts and model IDs. Every eval run logs the prompt hash, the model version, the temperature, and the tool definitions. If you change the prompt without a re-run, you are guessing.
A regression gate in CI. Prompt changes that drop the eval score by more than 3% block the merge.

Teams that resist this say it’s overkill for an early-stage product. We disagree. The eval set is the spec. If you can’t write down 50 examples of what “good” looks like, you don’t understand the problem yet, and you should not be shipping an agent for it.

Lesson 2 — Cost ceilings will save you from a 3 AM Slack message

The week we shipped our first customer-facing agent, a user discovered they could ask it to “summarize this entire 400-page PDF and then translate the summary into 12 languages.” The agent obliged. We discovered this when our billing alert went off at 2:47 AM.

Set hard ceilings, at three levels:

Per-call ceiling. Token budget per single agent invocation. We default to 30K input + 8K output tokens with explicit truncation logic. If a user input would exceed, we either summarize-first or refuse-with-explanation.
Per-user-per-day ceiling. Tracked in Redis or your DB. We default to $2-$5/day for free-tier users, $20-$50 for paid. Hitting the ceiling returns a graceful rate-limit response, not a 500.
Per-tenant-per-month ceiling. With Slack alerts at 50%, 80%, and 100% of the configured budget. Hard cutoff at 120%.

We average about $0.04-$0.18 per agent call across our deployed products, with the long tail going to about $1.20 on complex multi-hop tasks. The ceilings keep the long tail from becoming a four-figure surprise.

A subtler cost lesson: caching is a feature, not an optimization. Anthropic’s prompt caching cuts the cost of repeated system prompts by roughly 90%. If you’re not using it for any prompt over 4K tokens that gets called more than 10 times an hour, you’re lighting money on fire.

Lesson 3 — The fallback path matters more than the happy path

A useful question to ask your team: what does the agent do when the LLM call fails? When the tool call returns malformed JSON? When the user asks something the agent wasn’t designed for?

For most teams the answer is “I don’t know, probably 500s.” That answer is the bug.

Fallback design we now use as a default:

Retry once with a stricter system prompt if a tool call returns malformed output. Don’t loop. One retry, then escalate.
Schema-validate every structured output. If validation fails after the retry, return a typed error, not a hallucinated guess. We use Zod or Pydantic depending on stack.
Route to a deterministic path when confidence is low. If the agent’s classifier score for “this is a refund question” is below 0.7, route to a human or a rules-based handler. Don’t let the agent improvise.
User-visible degradation, not silent failure. “I’m not sure I can answer this — would you like me to escalate to a human?” is a better UX than a 6-second pause followed by something wrong.

In our most-trafficked agent, the happy path handles 71% of queries, the fallback path handles 24%, and the human escalation handles 5%. The fallback path is where the user trust gets built or burned. We spend roughly equal engineering time on the three.

Lesson 4 — Observability needs latency, tool traces, and cost in one view

Standard APM tools — Datadog, New Relic, Sentry — were built for request/response services. They are not enough for agents, because an agent is a graph of LLM calls and tool calls, with branching, retries, and recursion.

We use a dedicated LLM-ops layer (Langfuse, Helicone, or Arize Phoenix depending on client) and capture:

End-to-end trace per agent invocation. Every LLM call, every tool call, every retry, with parent/child relationships.
Latency at each step. Median, P95, P99. Agents that “feel slow” are almost always slow at one specific step you wouldn’t have guessed.
Cost per trace. Tied back to user ID and tenant ID. Sortable. The 80/20 of cost lives in 5% of the traces.
Output diff on prompt changes. When you change a system prompt, you need to see what the agent did differently on the eval set, side by side.

A real example: we had a customer-support agent averaging 8.2 seconds per response. Standard logging said “LLM call slow.” The trace view said “the first LLM call returns in 1.1 seconds, then we call a CRM tool that takes 5.8 seconds, then the second LLM call adds 1.3.” Fix: parallelize the CRM call with a preliminary LLM step. New average: 3.4 seconds.

You will not find that fix in console.log. Invest in the right tooling on day one.

Lesson 5 — The human-in-the-loop never fully goes away

The pitch “fully autonomous agent” is, in 2026, still mostly marketing. The agents that earn user trust have humans in the loop at strategic points:

Before any irreversible action. Sending an email, charging a card, scheduling a meeting, modifying production data. Show the proposed action, require confirmation. The friction is the feature.
As a periodic auditor. A weekly sample of 30-50 random agent outputs reviewed by a human, with a rubric, with the results fed back into the eval set.
As an escalation path. When the agent’s confidence is low, when the user explicitly asks, or when policy requires it (financial, medical, legal).

The agents that pretend they don’t need humans are the ones that end up in screenshots on Twitter for the wrong reasons.

A short note on what we’d change if we started over

If we were building agent number one again, knowing what we know now, the order would be:

Eval set first. Before any model selection. Before any tool design.
Cost ceilings configured before the first call to production.
Observability layer wired in before the first internal demo.
Happy path implemented.
Fallback path implemented at the same level of rigor as the happy path.
Human-in-the-loop checkpoints designed for the riskiest 20% of actions.

This is the inverse of how most teams build. Most teams write the happy path first, ship it, and bolt on the rest after the first incident. The cost of that order — in user trust, in surprise invoices, in late nights — is the lesson you want to learn from someone else’s scars rather than your own.

How we work with clients on production AI

Softronic builds production AI for B2B clients across the US and LatAm. We come in either as a delivery team for a specific agent or as an embedded “AI tiger team” for an existing product. We bring the eval-first methodology, the cost-control patterns, and the observability stack as defaults. You bring the domain knowledge and the data.

If your team is shipping AI features and hitting the demo-to-production wall, we should talk.

Production AI is not a model selection problem. It’s a systems engineering problem with a probabilistic component in the middle. Build it like that and the demos start matching reality.