From RAG to Voice: Building Production AI That Actually Ships
Most AI demos break in production. Here's how to build RAG over your data, vertical agents and voice interfaces that survive real users — and stay within budget.
A founder showed us a demo last month. Beautiful chatbot answering questions over 50,000 PDFs. Launched to 200 enterprise users on Monday. By Wednesday it was making up case numbers, hallucinating contract terms, and burning $4,200/day in tokens. They pulled it Friday.
This is the AI production gap, and it’s where most 2026 projects die. The demo works on a clean question against a clean dataset. Production gives you 50,000 messy PDFs, a long-tail of question shapes nobody anticipated, users who paste 30K-token transcripts into the search box, and a finance team asking why the OpenAI bill tripled. Closing that gap is the actual job.
Here’s what we’ve learned shipping production AI for 14 clients over the last 18 months: what works, what doesn’t, what to spend money on, and where to refuse to cut corners.
RAG that survives real users
Retrieval-augmented generation is the backbone of useful enterprise AI in 2026. Pure prompting against a frontier model is great for general chat. Pure fine-tuning is for narrow, stable tasks. Everything else — search over your data, support assistants, internal copilots — is RAG.
The naive RAG pipeline (embed everything, cosine similarity, stuff top-k into the prompt) is what you see in tutorials. It’s also what fails at scale. Here’s the production-grade version.
Chunking is where most pipelines break
Default chunking strategies split documents into 500-token windows with 50-token overlap. This destroys context, especially for legal, medical, and technical docs where the meaning lives across pages.
What actually works in 2026:
- Semantic chunking by section. Use the document’s own structure — headers, sections, paragraphs. For unstructured PDFs, use a layout-aware parser (Unstructured.io, LlamaParse, or Reducto).
- Parent-child chunks. Index small chunks for retrieval precision, return the parent chunk (the full section) for context. LangChain calls this
ParentDocumentRetriever. We call it the thing that actually works. - Metadata-aware chunks. Attach the source filename, page number, section title, last-updated date, and access permissions to every chunk. You will need all of these later.
Hybrid search beats vector search
Pure vector search is bad at exact-match queries. Try asking your vector-only system about “section 4.2.1” and watch it return semantically similar but wrong chunks.
We run hybrid search by default: BM25 (lexical) plus vector similarity, fused with reciprocal rank fusion. In Postgres with pgvector, this is 30 extra lines of SQL. In Pinecone or Weaviate it’s a feature flag. Skip it and you’ll burn weeks debugging “why did it return the wrong doc.”
Reranking is non-optional
Retrieval gives you 50 candidates. The LLM context can hold maybe 10. Picking which 10 with a reranker (Cohere Rerank, Voyage, or a small fine-tuned BERT) lifts answer quality 15-30% on every eval we’ve run. The cost is $0.0005 per query. Add it.
When to choose what storage
- pgvector on Postgres. Default choice for under 5M chunks. You already have Postgres. One less system to operate.
- Pinecone. When you cross 10M chunks or need multi-tenant isolation with per-namespace metadata. Costs scale with vectors stored.
- Turbopuffer. Cheap at scale, fast cold starts. We’re seeing more teams move here in 2026.
- Weaviate or Qdrant self-hosted. When data-residency or cost-at-scale demands it. You take on the ops.
Vertical agents that handle real workflows
Generic “AI agents” that browse the web and book flights are mostly a demo. Useful agents in 2026 are vertical: they handle one workflow end-to-end inside one domain, with tools constrained to that domain.
The pattern we ship most often: a customer support agent that reads a ticket, pulls the customer’s account state via API, checks against documentation via RAG, drafts a reply, and either sends it (low-risk tickets) or routes to a human with a recommended draft (high-risk tickets).
Building that reliably requires four things most teams skip.
Tools, not free-form actions
We define every action the agent can take as a typed function: get_customer_account(customer_id), check_refund_eligibility(order_id), escalate_to_human(reason, priority). The LLM picks tools and arguments; deterministic code executes them. This is the difference between an agent that works and an agent that “tries to.”
LangGraph and OpenAI’s Agents SDK both formalize this pattern. Anthropic’s tool-use API has been stable for two years. There is no excuse for free-form actions in 2026.
State machines beat “let the model figure it out”
For workflows with more than three steps, we use an explicit state machine. The LLM decides transitions; the framework enforces validity. LangGraph, Inngest’s agent kit, or Temporal-with-LLM-steps all work here.
Letting GPT-5 or Claude 4.7 “figure out the workflow” works in demos. In production, with 500 concurrent users, you’ll see the agent loop on itself, skip required steps, or just stall. State machines fix this.
Guardrails at three layers
- Input. Filter prompt injection attempts before they hit the model. Lakera, NeMo Guardrails, or a small classifier you train yourself.
- Tool. Permission checks at the tool boundary. The model can ask to refund $50K; your code refuses unless the user is an admin.
- Output. PII scrubbing and policy checks before the response goes to the user.
Guardrails are not optional once you have real customers. The cost of one bad output to one important customer is higher than every guardrail you’ll ever pay for.
Evals from day one
If you don’t have an eval set, you don’t have an AI product. You have a vibe.
Our minimum: 50 hand-curated test cases that represent real user inputs, with expected outputs (or expected behavior signals). Re-run on every prompt change, every model upgrade, every retrieval tweak. Track pass rate and latency over time.
Tools we use: Braintrust, LangSmith, or a homegrown harness with PostHog logging the results. Honeycomb if you want OpenTelemetry-native tracing.
Voice agents — the hard mode
Voice is where most teams underestimate the engineering work by 10x. Text RAG that takes 3 seconds to respond feels fine. A voice agent that takes 3 seconds feels broken.
Target latency budget for a voice agent: under 800ms from end-of-speech to start-of-response. Under 500ms feels natural. Over 1.2s and users hang up.
That budget gets eaten fast:
- Speech-to-text: 100-200ms (Deepgram Nova-3, AssemblyAI Universal-2)
- LLM first-token latency: 200-500ms (Claude Haiku, GPT-4.1 Mini, or a small fine-tuned model)
- Text-to-speech first-byte: 100-300ms (ElevenLabs Turbo, Cartesia, OpenAI tts-1)
- Network and orchestration overhead: 50-200ms
The infra companies that solve most of this for you in 2026: Vapi, Retell, LiveKit Agents, and Pipecat. Pick one. Build only the business logic.
The non-obvious things that matter more than model choice for voice:
- Interruption handling. Real conversations interrupt. If your agent can’t be cut off mid-sentence, it feels robotic.
- Endpoint detection. Knowing when the user stopped talking. VAD (voice activity detection) needs tuning per use case. Customer support tolerates more pause time than emergency dispatch.
- Streaming everything. STT streams to LLM streams to TTS streams to audio. Anywhere you wait for a full response is a place latency dies.
- Fallbacks. When the LLM stalls, the agent says “let me check on that” while you queue up a retry in the background. Silence is the worst user experience.
We’ve shipped voice agents for sales qualification, appointment scheduling, and clinical intake. The ones that work treat voice as a real-time systems problem, not an AI problem.
LLM ops — the boring stuff that decides if you survive
The teams whose AI products are still alive in month 12 share a few habits.
Cost ceilings, not cost guesses
Every LLM call gets a token budget. Every user gets a daily ceiling. Every endpoint gets a circuit breaker. The default behavior when budgets blow: degrade gracefully (smaller model, cached response, “we’re seeing high volume, try again in a minute”). Never silently keep burning.
We use Helicone or Langfuse to track per-user, per-feature spend. A weekly report goes to the team. Surprises in the OpenAI bill stop showing up.
Caching the cacheable
Prompt prefix caching (Anthropic, OpenAI, and Gemini all support it in 2026) cuts costs 50-90% on RAG workloads where the system prompt and retrieved chunks repeat. Turn it on.
Semantic caching of full responses for FAQ-style queries: another 30-60% saved. Redis with a vector index, or GPTCache.
Model routing
Not every query needs the frontier model. We route by complexity: simple intent classification to Haiku or 4.1-mini, complex reasoning to Sonnet or GPT-5, the heaviest analytical work to Opus or o3. A router (custom or OpenRouter) saves 60-80% of spend on most products.
Observability
Every LLM call logs: prompt, response, latency, tokens, cost, user, feature, model. We tag every prompt with a version. When quality drops, we know exactly which change caused it.
Pricing reality
From $15K. 3-8 weeks.
What that buys at the low end: a production RAG over your data with hybrid retrieval, evals, guardrails, and observability. Deployed, monitored, handed off.
At the higher end: a vertical agent or voice agent with multi-tool workflows, state machines, eval harness, and 60 days of post-launch tuning.
What it does not buy: a “we’ll figure out the use case as we build” project. Bring a real workflow you want automated, or a real question your users keep asking. AI without a target is the most expensive way to spend money in 2026.
What we won’t build
We turn down AI projects when:
- The user already has 4 AI vendors and is on the 5th. The problem isn’t the model.
- The “use case” is “we need AI in the product.” That’s a board ask, not a problem.
- The data isn’t ready. We’d rather spend 4 weeks on data pipelines first than ship a hallucinating product.
Ready to ship AI that doesn’t break Monday?
Tell us the workflow you want automated. We’ll come back with: feasibility, recommended architecture, eval plan, and a fixed price. Free 30-minute call.
Book a call about AI or read how we vet engineers before you trust them with your production AI stack.