Building a RAG System in Production: Lessons Learned

RAG (Retrieval-Augmented Generation) has become the dominant pattern for integrating LLMs into business contexts. But between a working POC and a production RAG system, there is a chasm. Here's what we've learned.

Software architecture and AI

RAG (Retrieval-Augmented Generation) is the dominant pattern for giving an LLM access to your data without fine-tuning. A RAG POC can be built in a day. A reliable production RAG system takes weeks. Here are the problems you'll encounter and how to anticipate them.

What RAG Solves — And What It Doesn't

RAG solves the knowledge cutoff and proprietary data problem: your LLM doesn't know your internal documents, emails, or knowledge base. RAG gives it access to these sources at inference time.

RAG does not solve hallucinations. An LLM can ignore retrieved sources, misinterpret them, or fabricate information between provided excerpts. This is a distinct problem requiring specific evaluation and guardrails.

The 5 Decisions That Define Your RAG Quality

1. Chunking strategy

Splitting your documents into chunks is the most impactful and least discussed decision. Too small loses context. Too large exceeds the context window or dilutes relevance.

What works in production: semantic chunking (split at natural meaning boundaries, not at 512 tokens), hierarchical chunking (store both the chunk and its parent summary), and overlapping chunks (each chunk includes the last 100 tokens of the previous chunk).

2. Embedding model

Embeddings transform your chunks into vectors. The model choice determines semantic search quality. In 2026, multilingual embedding models (like OpenAI's text-embedding-3-large or Intfloat's E5 models) outperform older models on complex business tasks. Important: the embedding model must be the same at indexing and query time.

3. Retrieval strategy

Pure vector search (cosine similarity) isn't enough. Hybrid approaches — combining vector search and lexical search (BM25) — systematically outperform vector-only search on business benchmarks. Re-ranking (passing results through a cross-encoder model) further significantly improves final quality.

4. Context presentation in the prompt

How you present retrieved sources to the LLM directly impacts response quality. Best practices: explicitly indicate the provenance of each excerpt, ask the model to cite sources, and build a system prompt that guides behavior when information is insufficient.

5. Continuous evaluation

A RAG system without evaluation is a system that silently degrades. Set up measurable metrics from day 1: faithfulness, answer relevancy, and context recall. Frameworks like RAGAS automate this evaluation.

Problems You Don't See in the POC

Latency — In production, retrieval over millions of chunks, re-ranking, and LLM calls accumulate. Measure and set SLAs from development.

Situational hallucinations — Your RAG may work perfectly on 95% of questions and hallucinate on specific edge cases. A representative evaluation set must cover these hard cases.

Data freshness — Your vector database becomes stale as source documents change. Define your reindexing policy upfront.

Security and access control — In production, not all users should access all documents. Chunk-level authorization is complex and often omitted in POCs.

The Valtieri team designs RAG systems in production for critical business contexts. Contact us for a scoping call.

A project? A question?

Contact us →