Advanced RAG Techniques in 2026: Hybrid Search, Graph RAG, Reranking, and Evaluation
Go beyond basic RAG with advanced techniques used in production systems. Covers hybrid search, Graph RAG, cross-encoder reranking, query decomposition, and evaluation frameworks.
Beyond Basic RAG
Basic RAG (chunk documents → embed → top-k similarity search → LLM generation) works for demos. But production systems in 2026 use 5+ advanced techniques stacked together to achieve >95% retrieval accuracy.
Here's what a production-grade RAG pipeline looks like:
Query → Query Rewriting → Hybrid Search → Reranking → Graph Augmentation → LLM Generation
Let's break down each layer.
1. Query Rewriting & Decomposition
Raw user queries are terrible for retrieval. Users ask "How do I fix the error?" — that's useless for vector search.
Query Rewriting transforms the user query before retrieval:
from openai import OpenAIdef rewrite_query(query: str, chat_history: list) -> str:
"""Rewrite a raw user query into a standalone, search-optimized query."""
client = OpenAI()
messages = [
{"role": "system", "content": """
Rewrite the user's query into a standalone question optimized for
document retrieval. Expand abbreviations, resolve references to
chat history, and use domain-specific terminology.
Return ONLY the rewritten query, nothing else.
"""},
*chat_history[-6:], # Last 6 messages for context
{"role": "user", "content": query},
]
response = client.chat.completions.create(
model="gpt-4o-mini", # Cheap model works fine for this
messages=messages,
temperature=0.1,
)
return response.choices[0].message.content
# Before: "How do I fix it?"
# After: "How to fix the 'context window full' error in Claude Code"
Query Decomposition splits complex multi-part questions:
def decompose_query(query: str) -> list[str]:
"""Break a multi-part query into individual retrieval sub-queries."""
client = OpenAI() response = client.chat.completions.create(
model="gpt-4o-mini",
messages=[{
"role": "system",
"content": "Break this complex question into 2-5 simpler sub-questions. Return as a JSON array of strings."
}, {
"role": "user",
"content": query
}],
response_format={"type": "json_object"},
)
sub_queries = json.loads(response.choices[0].message.content)["queries"]
return sub_queries
# Input: "Compare LangChain and LlamaIndex for RAG with Pinecone, including costs and scaling"
# Output: ["What is LangChain's RAG architecture with Pinecone?", "What is LlamaIndex's RAG architecture with Pinecone?", ...]
2. Hybrid Search: Dense + Sparse
Vector search alone misses exact keyword matches. Keyword search alone misses semantic matches. Hybrid search combines both.
from langchain_community.retrievers import BM25Retriever
from langchain_community.vectorstores import Pinecone
import numpy as npclass HybridRetriever:
def __init__(self, vector_store, documents):
self.vector_store = vector_store
self.bm25_retriever = BM25Retriever.from_documents(documents)
self.bm25_retriever.k = 10
def retrieve(self, query: str, alpha: float = 0.5, top_k: int = 5):
"""
Combines dense (vector) and sparse (BM25) retrieval.
alpha = 0 → pure BM25, alpha = 1 → pure vector search.
"""
# Dense retrieval
dense_results = self.vector_store.similarity_search_with_relevance_scores(
query, k=10
)
dense_scores = {doc.page_content: score for doc, score in dense_results}
# Sparse retrieval (BM25)
sparse_results = self.bm25_retriever.invoke(query)
max_sparse = len(sparse_results) # Normalize to 0-1
sparse_scores = {}
for i, doc in enumerate(sparse_results):
sparse_scores[doc.page_content] = 1 - (i / max_sparse)
# Combine scores
all_docs = set(list(dense_scores.keys()) + list(sparse_scores.keys()))
combined = []
for doc in all_docs:
dense_score = dense_scores.get(doc, 0)
sparse_score = sparse_scores.get(doc, 0)
combined_score = alpha dense_score + (1 - alpha) sparse_score
combined.append((doc, combined_score))
# Sort and return top-k
combined.sort(key=lambda x: x[1], reverse=True)
return combined[:top_k]
When to adjust alpha: - Technical code queries → α = 0.3 (BM25 weighted higher — exact variable names matter) - Conceptual questions → α = 0.7 (semantic matching matters more) - Product names/APIs → α = 0.2 (keywords critical) - Default starting point → α = 0.5
3. Cross-Encoder Reranking
The biggest single improvement you can make to RAG quality is reranking. A cross-encoder reads query + document together (not independently like bi-encoders), giving much more accurate relevance scores.
from sentence_transformers import CrossEncoder
import torchclass Reranker:
def __init__(self, model_name: str = "BAAI/bge-reranker-v2-m3"):
self.model = CrossEncoder(model_name)
# Optimized for:
# - MS-MARCO (web/docs)
# - BGE-Reranker (balanced)
# - Cohere Rerank (API-based)
def rerank(self, query: str, documents: list, top_k: int = 3, apply_threshold: float = 0.0):
"""Rerank documents using cross-encoder."""
pairs = [[query, doc] for doc in documents]
with torch.no_grad():
scores = self.model.predict(pairs)
# Sort by score descending
scored_docs = list(zip(documents, scores))
scored_docs.sort(key=lambda x: x[1], reverse=True)
# Apply threshold filter
filtered = [(doc, score) for doc, score in scored_docs if score >= apply_threshold]
return filtered[:top_k]
Reranking impact (benchmark):
| Retrieval Method | Top-5 Accuracy | Top-3 Accuracy | Top-1 Accuracy | |-----------------|---------------|---------------|---------------| | Dense only | 78% | 65% | 42% | | Hybrid (dense + BM25) | 86% | 74% | 51% | | Hybrid + Reranking | 94% | 88% | 73% |
Reranking consistently improves top-1 accuracy by 20-25 percentage points.
4. Graph RAG: Understanding Relationships
Standard RAG treats documents as independent chunks. Graph RAG captures relationships between entities, enabling multi-hop reasoning.
from langchain_graph_rag import GraphRAGgraph_rag = GraphRAG(
# Extract entities and relationships from documents
entity_extractor="gpt-4o-mini",
graph_store="neo4j", # or "memgraph", "kuzu"
# Indexing configuration
relationship_extraction={
"types": ["DEPENDS_ON", "USES", "ALTERNATIVE_TO", "PART_OF"],
"max_depth": 3,
},
# Retrieval
hybrid_search=True,
reranker=True,
community_detection=True, # Group related entities
)
# Index documents
graph_rag.add_documents(documents)
# Query with multi-hop reasoning
result = graph_rag.query("Which vector databases support hybrid search and have Python SDKs?")
# Without Graph RAG: "Does Pinecone support hybrid search?"
# With Graph RAG: "Pinecone → supports → hybrid search, Pinecone → has → Python SDK"
# "Weaviate → supports → hybrid search, Weaviate → has → Python SDK"
Graph RAG excels at questions involving: - Comparisons ("What's better, X or Y for my use case?") - Causal chains ("Why does Z happen when I use X?") - Multi-step ("Find tools that do A and B but not C") - Trade-offs ("What do I give up by choosing X over Y?")
5. RAPTOR: Recursive Abstractive Processing
RAPTOR creates a hierarchical summary tree. Documents are chunked, chunks are summarized, summaries are summarized again — creating layers of abstraction.
Level 0: Raw chunks (1000s of small pieces)
↓
Level 1: Topic clusters (100s of medium pieces)
↓
Level 2: Section summaries (10s of summaries)
↓
Level 3: Document summary (1 executive summary)
During retrieval, RAPTOR searches all levels and picks the most relevant. A question like "What's the main finding of this 200-page report?" matches Level 3 — it doesn't need to search a million chunks.
from raptor import RAPTORraptor = RAPTOR(
embed_model="text-embedding-3-large",
llm_model="gpt-4o-mini", # For summarization
clustering="gmm", # Gaussian Mixture for better topic clustering
levels=4,
)
tree = raptor.index_documents(documents)
results = raptor.retrieve("What are the key conclusions?", levels=[2, 3])
6. Evaluation: Don't Deploy Without It
The #1 RAG mistake? Deploying without evaluation. Production RAG needs continuous monitoring:
from ragas import evaluate
from ragas.metrics import (
faithfulness,
answer_relevancy,
context_precision,
context_recall,
)# Create test set
test_set = [
{
"question": "How do I set up Claude Code with DeepSeek?",
"answer": "...", # Generated by your RAG system
"contexts": ["..."], # Retrieved documents
"ground_truth": "...", # Ideal answer
}
]
# Run evaluation
results = evaluate(
dataset=test_set,
metrics=[
faithfulness, # Does the answer stay true to the context?
answer_relevancy, # Is the answer relevant to the question?
context_precision, # Are the retrieved documents all relevant?
context_recall, # Are all relevant documents retrieved?
],
)
print(f"""
Faithfulness: {results['faithfulness']:.2%}
Answer Relevancy: {results['answer_relevancy']:.2%}
Context Precision: {results['context_precision']:.2%}
Context Recall: {results['context_recall']:.2%}
""")
Production targets: | Metric | Good | Great | Excellent | |--------|------|-------|-----------| | Faithfulness | >85% | >92% | >97% | | Answer Relevancy | >80% | >90% | >95% | | Context Precision | >75% | >85% | >92% | | Context Recall | >70% | >82% | >90% |
7. Putting It All Together: Production Pipeline
Here's what a complete production RAG pipeline looks like in 2026:
class ProductionRAGPipeline:
def __init__(self):
self.query_rewriter = QueryRewriter()
self.hybrid_retriever = HybridRetriever(vector_store, documents)
self.reranker = Reranker()
self.graph_rag = GraphRAG(...)
self.llm = ChatOpenAI(model="gpt-4o") def query(self, user_query: str, chat_history: list = None):
# 1. Rewrite query for optimal retrieval
rewritten = self.query_rewriter.rewrite(user_query, chat_history or [])
# 2. Hybrid search (dense + sparse)
initial_results = self.hybrid_retriever.retrieve(rewritten, alpha=0.5, top_k=10)
# 3. Graph-aware expansion
graph_context = self.graph_rag.query(rewritten)
# 4. Rerank with cross-encoder
doc_texts = [doc for doc, _ in initial_results]
reranked = self.reranker.rerank(rewritten, doc_texts, top_k=4)
# 5. Generate response
context = self._build_context(reranked, graph_context)
response = self.llm.invoke(f"Context:\n{context}\n\nQuestion: {user_query}")
return response
Upgrade Path
Where to invest based on your current RAG quality:
| If your accuracy is... | Invest in... | Expected gain | |----------------------|--------------|---------------| | <60% | Better chunking + reranking | +25% | | 60-75% | Hybrid search + query rewriting | +15% | | 75-85% | Cross-encoder reranking | +10% | | 85-92% | Graph RAG + entity extraction | +7% | | 92-97% | RAPTOR hierarchical retrieval | +3% | | >97% | You're done. Focus on latency and cost |
Quick Start
pip install langchain pinecone-client sentence-transformers ragas
The biggest bang for your buck in 2026: hybrid search + cross-encoder reranking. Those two techniques alone will get most RAG systems above 90% retrieval accuracy. Graph RAG and RAPTOR are for the last few percentage points — only invest if your baseline is already solid.
Related Articles
Complete Guide to Fine-Tuning LLMs in 2026: From LoRA to Full Fine-Tuning
A practical guide to fine-tuning LLMs in 2026. Compare LoRA, QLoRA, full fine-tuning, and DPO. Includes GPU requirements, cost estimates, step-by-step tutorials, and when to choose each approach.
LlamaIndex vs LangChain in 2026: Which RAG Framework Should You Use?
Head-to-head comparison of LlamaIndex and LangChain for building RAG applications in 2026. We compare data ingestion, retrieval quality, agent capabilities, and production readiness with real benchmarks.
How to Set Up Claude Code with DeepSeek API (Save 97% on AI Coding Costs)
Step-by-step guide to using Claude Code with DeepSeek as the backend model instead of Anthropic. Cut your AI coding costs by 97% while keeping the same workflow and tools.