LangChain in Production: Building a Reliable RAG Pipeline

Go beyond the LangChain hello-world tutorial. Learn how to build a production-ready RAG pipeline with proper chunking strategies, retrieval optimization, evaluation, and monitoring.

·15 min read

Beyond the Demo

Most LangChain tutorials stop at the hello-world stage: load a document, split it, stuff it into a vector store, and ask a question. That's fine for a demo, but it falls apart when you hit production.

In this guide, we'll build a RAG pipeline that actually works in production. We'll cover five things most tutorials skip:

- Document chunking strategies that preserve semantic boundaries - Retrieval optimization with hybrid search and re-ranking - Evaluation — measuring whether your RAG is actually good - Observability with LangSmith tracing - Error handling and fallbacks for real-world robustness

Chunking Strategy: Why Naive Splitting Fails

The most common mistake in RAG is using RecursiveCharacterTextSplitter with arbitrary chunk sizes. It splits mid-sentence, mid-table, mid-code-block.

from langchain.text_splitter import (
    RecursiveCharacterTextSplitter,
    MarkdownHeaderTextSplitter,
)

# Better: semantic chunking with markdown awareness headers_to_split_on = [ ("#", "Header 1"), ("##", "Header 2"), ("###", "Header 3"), ]

markdown_splitter = MarkdownHeaderTextSplitter( headers_to_split_on=headers_to_split_on )

# Then further split large sections, preserving context char_splitter = RecursiveCharacterTextSplitter( chunk_size=1000, chunk_overlap=200, separators=["\n\n", "\n", ".", " ", ""], )

# Full pipeline raw_docs = markdown_splitter.split_text(markdown_content) chunks = char_splitter.split_documents(raw_docs)

This approach respects the document's natural structure. Tables stay intact. Code blocks stay intact. Paragraphs don't get cut mid-sentence.

Choosing the Right Embeddings

Not all embedding models are created equal. Here's a quick cheat sheet for production:

from langchain_openai import OpenAIEmbeddings
from langchain_community.embeddings import HuggingFaceEmbeddings

# Option 1: OpenAI (best quality, $0.13/1M tokens) embeddings = OpenAIEmbeddings(model="text-embedding-3-small")

# Option 2: Open-source (free, good for sensitive data) embeddings = HuggingFaceEmbeddings( model_name="BAAI/bge-small-en-v1.5", encode_kwargs={"normalize_embeddings": True}, )

# Option 3: Cohere (best for retrieval-augmented search) from langchain_cohere import CohereEmbeddings embeddings = CohereEmbeddings(model="embed-english-v3.0")

For production, I recommend fine-tuning your embedding retrieval with a small evaluation dataset rather than relying on defaults. Even 50 curated query-document pairs can dramatically improve retrieval quality.

Hybrid Search: Vector + Keyword

Vector search alone misses exact matches. Keyword search alone misses semantic matches. Use both.

from langchain_community.retrievers import BM25Retriever
from langchain.retrievers import EnsembleRetriever

# Vector retriever vector_retriever = vectorstore.as_retriever( search_type="similarity", search_kwargs={"k": 10}, )

# Keyword retriever keyword_retriever = BM25Retriever.from_documents(chunks) keyword_retriever.k = 10

# Ensemble with configurable weights ensemble_retriever = EnsembleRetriever( retrievers=[vector_retriever, keyword_retriever], weights=[0.7, 0.3], )

The ensemble approach gives you the best of both worlds. The 70/30 weight split works well for most document types. If your documents are code-heavy, shift toward keyword. If they're prose-heavy, lean into vector.

Re-Ranking: The Missing Piece

Retrieval returns the top-k chunks by similarity. But not all of them are actually relevant. A cross-encoder re-ranker fixes this.

from langchain.retrievers import ContextualCompressionRetriever
from langchain.retrievers.document_compressors import CrossEncoderReranker
from langchain_community.cross_encoders import HuggingFaceCrossEncoder

# Load a cross-encoder model cross_encoder = HuggingFaceCrossEncoder( model_name="BAAI/bge-reranker-v2-m3" )

compressor = CrossEncoderReranker( model=cross_encoder, top_n=3, )

compression_retriever = ContextualCompressionRetriever( base_compressor=compressor, base_retriever=ensemble_retriever, )

# Now your LLM only gets the 3 most relevant chunks results = compression_retriever.invoke("What is the refund policy?")

Re-ranking adds 200-500ms of latency but dramatically improves answer quality. For any production RAG system, it's non-negotiable.

Putting It Together: Production RAG Chain

from operator import itemgetter
from langchain_core.prompts import ChatPromptTemplate
from langchain_core.output_parsers import StrOutputParser
from langchain_core.runnables import RunnablePassthrough, RunnableParallel
from langchain_openai import ChatOpenAI

llm = ChatOpenAI(model="gpt-4o", temperature=0)

template = """You are a helpful assistant answering questions based on the provided context.

Context: {context}

Question: {question}

Instructions: 1. Answer based solely on the context above 2. If the context doesn't contain the answer, say "I cannot find this information in the provided documents" 3. Cite relevant source headers when possible 4. Keep answers concise but complete

Answer:"""

prompt = ChatPromptTemplate.from_template(template)

def format_docs(docs): return "\n\n---\n\n".join(doc.page_content for doc in docs)

rag_chain = ( RunnableParallel( {"context": compression_retriever | format_docs, "question": RunnablePassthrough()} ) | prompt | llm | StrOutputParser() )

# Usage response = rag_chain.invoke("What are the requirements for a refund?")

Evaluation: Know If Your RAG Is Actually Working

Without evaluation, you're flying blind. Here's a minimal but effective evaluation setup:

from langsmith import Client
from langsmith.evaluation import evaluate

# Log traces to LangSmith (more on this below) client = Client()

# Define a test dataset test_questions = [ "What is the return policy?", "How do I reset my password?", "What payment methods do you accept?", ]

# Evaluate answer correctness using LLM-as-judge from langsmith.evaluation import StringEvaluator

def correctness_evaluator(run, example): # This calls an LLM to judge if the answer matches the expected answer # You can also use Ragas for more structured evaluation pass # See Ragas integration below

# Alternatively, use Ragas for structured RAG evaluation

For a more structured approach, integrate Ragas:

from ragas.metrics import (
    faithfulness,
    answer_relevancy,
    context_precision,
    context_recall,
)
from ragas.llm import LangchainLLM

# These four metrics tell you: # 1. Faithfulness: Did the LLM hallucinate? # 2. Answer Relevancy: Does the answer actually address the question? # 3. Context Precision: Were the retrieved chunks relevant? # 4. Context Recall: Did we retrieve enough relevant context?

Observability with LangSmith

LangSmith is LangChain's official observability platform. It's free for up to 1,000 traces per month and essential for debugging RAG in production.

# Set environment variables
export LANGCHAIN_TRACING_V2=true
export LANGCHAIN_API_KEY=your-api-key
export LANGCHAIN_PROJECT=ai-tooling-guide-rag

# That's it — once set, all LangChain runs are automatically traced
# You can see:
# - Which documents were retrieved for each query
# - How long each step took
# - The full chain of calls (retrieval → re-ranking → LLM → output)
# - Cost tracking per run

In production, use LangSmith to set up monitors that alert you when retrieval quality drops or latency spikes.

Error Handling for Production

from langchain_core.runnables import RunnableBranch
from langchain.output_parsers import RetryOutputParser

def fallback_chain(query: str) -> str: """When RAG fails, fall back to a general LLM response.""" return llm.invoke( f"You're a general assistant. Answer: {query}" ).content

# Wrap your RAG chain with a try/except def safe_rag(query: str) -> str: try: return rag_chain.invoke(query) except Exception as e: print(f"RAG failed: {e}. Falling back.") return fallback_chain(query)

Production Checklist

Before deploying your RAG pipeline, verify each item:

- Chunking: Test chunk boundaries on your actual documents. Look for mid-table, mid-code, mid-sentence splits. - Retrieval: Run at least 50 test queries. Measure recall@5. Aim for >0.85. - Re-ranking: Confirm your re-ranker actually improves top-3 relevance over naive similarity. - Latency: Benchmark p50, p95, p99 latency. Target <3s for the full pipeline. - Cost: Estimate daily token usage. Embedding queries are usually the biggest cost factor. - Fallback: Test what happens when the vector store is down, or when no relevant chunks are found. - Monitoring: Set up LangSmith alerts for error rate >1% and p99 latency >5s.

Where to Go Next

This pipeline is a solid foundation, but there's always more to optimize:

- Query rewriting: Transform user queries before retrieval (e.g., expanding acronyms, correcting typos) - Multi-hop retrieval: For questions that need information from multiple documents - Graph RAG: When relationships between entities matter (using LangChain's graph integration) - Streaming: Return tokens as they're generated for better UX

The difference between a demo RAG and a production RAG is in the details — proper chunking, hybrid retrieval, re-ranking, evaluation, and monitoring. Each layer adds complexity but also reliability. Start with the basics and add layers as your use case demands.

Ad Unit Placeholder

Related Articles