Skip to main content
0%
AI & Automation10 min read

Building a Production-Ready RAG Pipeline: Lessons From Real Deployments

RAG demos are easy. Production RAG is hard. Here's what we learned building AI document pipelines for legal and enterprise clients — including the mistakes we made and how we fixed them.

Kavya Rao

Kavya Rao

AI & Automation Engineer · 3 September 2025

Building a Production-Ready RAG Pipeline: Lessons From Real Deployments

Why Production RAG is Different

Every AI demo uses the same RAG pipeline: chunk documents, embed them, store in a vector DB, retrieve on query, pass to LLM, return answer.

The demo works. The production system breaks in ways you don't expect.

Here's what we learned building RAG for a legal document analysis platform that processes 800+ contracts per day.

Mistake 1: Fixed Chunk Sizes

The classic tutorial says chunk at 512 tokens with 50-token overlap. This works for Wikipedia. It does not work for contracts.

Legal documents have structure: definitions sections, clause hierarchies, cross-references. A 512-token chunk that splits a definition from its context produces hallucinated answers.

Fix: Use semantic chunking. Split on meaningful boundaries (paragraphs, clauses) rather than token counts. LangChain's SemanticChunker is a good starting point, but we ended up writing a domain-specific chunker that respected legal document structure.

Mistake 2: Cosine Similarity Is Not Enough

Pure vector similarity retrieves semantically similar text. But "what is the termination clause?" and "what happens if either party terminates?" retrieve different chunks — even though the answer is in the same section.

Fix: Hybrid search. Combine dense vector retrieval (semantic) with sparse BM25 retrieval (keyword). Reciprocal Rank Fusion (RRF) merges the results. We use pgvector for dense + PostgreSQL full-text search for sparse — no additional infrastructure needed.

Mistake 3: No Evaluation Pipeline

How do you know if your RAG is getting better or worse after a change? Most teams don't know. They deploy a change, get a user complaint a week later, and scramble.

Fix: Build a regression test suite before you go to production. We use LangSmith with a set of 50 question/answer pairs from real documents. Every deploy runs this suite. If accuracy drops below 90%, the deploy is blocked.

from langsmith import evaluate

results = evaluate( rag_pipeline, data="contract-qa-v1", evaluators=["qa", "context_precision"], )

Mistake 4: Ignoring Latency

GPT-4o with a 4,000-token context window takes 8–12 seconds to respond. That's unusable for a document review tool where lawyers query hundreds of documents per day.

Fix:

  • Use streaming responses so users see output immediately
  • Cache embeddings (documents don't change often)
  • Use a smaller model (GPT-4o-mini) for initial triage, escalate to GPT-4o only for complex queries
We cut average response time from 11s to 2.4s with these changes.

What a Production RAG Stack Looks Like

Document ingestion: Python + Unstructured.io
Chunking: Custom semantic chunker
Embeddings: OpenAI text-embedding-3-large (cached in Redis)
Vector store: pgvector (PostgreSQL)
Retrieval: Hybrid dense + BM25 with RRF
LLM: GPT-4o with streaming
Evaluation: LangSmith
Orchestration: LangGraph
API: FastAPI

The Result

Our legal client went from 4-hour contract reviews to 12-minute reviews with 96.4% accuracy. The remaining 3.6% failure rate is caught by a human review step we built into the workflow.

Production AI is an engineering discipline, not a prompt engineering exercise.

Talk to our AI team if you're building a similar system.

RAGLangChainLLMPythonAIVector DatabaseOpenAI