Advanced RAG Techniques in 2026: Hybrid Search, Graph RAG, Reranking, and Evaluation

Go beyond basic RAG with advanced techniques used in production systems. Covers hybrid search, Graph RAG, cross-encoder reranking, query decomposition, and evaluation frameworks.

·13 min read

Beyond Basic RAG

Basic RAG (chunk documents → embed → top-k similarity search → LLM generation) works for demos. But production systems in 2026 use 5+ advanced techniques stacked together to achieve >95% retrieval accuracy.

Here's what a production-grade RAG pipeline looks like:

Query → Query Rewriting → Hybrid Search → Reranking → Graph Augmentation → LLM Generation

Let's break down each layer.

1. Query Rewriting & Decomposition

Raw user queries are terrible for retrieval. Users ask "How do I fix the error?" — that's useless for vector search.

Query Rewriting transforms the user query before retrieval:

from openai import OpenAI

def rewrite_query(query: str, chat_history: list) -> str: """Rewrite a raw user query into a standalone, search-optimized query.""" client = OpenAI()

messages = [ {"role": "system", "content": """ Rewrite the user's query into a standalone question optimized for document retrieval. Expand abbreviations, resolve references to chat history, and use domain-specific terminology.

Return ONLY the rewritten query, nothing else. """}, *chat_history[-6:], # Last 6 messages for context {"role": "user", "content": query}, ]

response = client.chat.completions.create( model="gpt-4o-mini", # Cheap model works fine for this messages=messages, temperature=0.1, ) return response.choices[0].message.content

# Before: "How do I fix it?" # After: "How to fix the 'context window full' error in Claude Code"

Query Decomposition splits complex multi-part questions:

def decompose_query(query: str) -> list[str]:
    """Break a multi-part query into individual retrieval sub-queries."""
    client = OpenAI()

response = client.chat.completions.create( model="gpt-4o-mini", messages=[{ "role": "system", "content": "Break this complex question into 2-5 simpler sub-questions. Return as a JSON array of strings." }, { "role": "user", "content": query }], response_format={"type": "json_object"}, )

sub_queries = json.loads(response.choices[0].message.content)["queries"] return sub_queries

# Input: "Compare LangChain and LlamaIndex for RAG with Pinecone, including costs and scaling" # Output: ["What is LangChain's RAG architecture with Pinecone?", "What is LlamaIndex's RAG architecture with Pinecone?", ...]

2. Hybrid Search: Dense + Sparse

Vector search alone misses exact keyword matches. Keyword search alone misses semantic matches. Hybrid search combines both.

from langchain_community.retrievers import BM25Retriever
from langchain_community.vectorstores import Pinecone
import numpy as np

class HybridRetriever: def __init__(self, vector_store, documents): self.vector_store = vector_store self.bm25_retriever = BM25Retriever.from_documents(documents) self.bm25_retriever.k = 10

def retrieve(self, query: str, alpha: float = 0.5, top_k: int = 5): """ Combines dense (vector) and sparse (BM25) retrieval. alpha = 0 → pure BM25, alpha = 1 → pure vector search. """

# Dense retrieval dense_results = self.vector_store.similarity_search_with_relevance_scores( query, k=10 ) dense_scores = {doc.page_content: score for doc, score in dense_results}

# Sparse retrieval (BM25) sparse_results = self.bm25_retriever.invoke(query) max_sparse = len(sparse_results) # Normalize to 0-1 sparse_scores = {} for i, doc in enumerate(sparse_results): sparse_scores[doc.page_content] = 1 - (i / max_sparse)

# Combine scores all_docs = set(list(dense_scores.keys()) + list(sparse_scores.keys())) combined = [] for doc in all_docs: dense_score = dense_scores.get(doc, 0) sparse_score = sparse_scores.get(doc, 0) combined_score = alpha dense_score + (1 - alpha) sparse_score combined.append((doc, combined_score))

# Sort and return top-k combined.sort(key=lambda x: x[1], reverse=True) return combined[:top_k]

When to adjust alpha: - Technical code queries → α = 0.3 (BM25 weighted higher — exact variable names matter) - Conceptual questions → α = 0.7 (semantic matching matters more) - Product names/APIs → α = 0.2 (keywords critical) - Default starting point → α = 0.5

3. Cross-Encoder Reranking

The biggest single improvement you can make to RAG quality is reranking. A cross-encoder reads query + document together (not independently like bi-encoders), giving much more accurate relevance scores.

from sentence_transformers import CrossEncoder
import torch

class Reranker: def __init__(self, model_name: str = "BAAI/bge-reranker-v2-m3"): self.model = CrossEncoder(model_name) # Optimized for: # - MS-MARCO (web/docs) # - BGE-Reranker (balanced) # - Cohere Rerank (API-based)

def rerank(self, query: str, documents: list, top_k: int = 3, apply_threshold: float = 0.0): """Rerank documents using cross-encoder.""" pairs = [[query, doc] for doc in documents]

with torch.no_grad(): scores = self.model.predict(pairs)

# Sort by score descending scored_docs = list(zip(documents, scores)) scored_docs.sort(key=lambda x: x[1], reverse=True)

# Apply threshold filter filtered = [(doc, score) for doc, score in scored_docs if score >= apply_threshold]

return filtered[:top_k]

Reranking impact (benchmark):

| Retrieval Method | Top-5 Accuracy | Top-3 Accuracy | Top-1 Accuracy | |-----------------|---------------|---------------|---------------| | Dense only | 78% | 65% | 42% | | Hybrid (dense + BM25) | 86% | 74% | 51% | | Hybrid + Reranking | 94% | 88% | 73% |

Reranking consistently improves top-1 accuracy by 20-25 percentage points.

4. Graph RAG: Understanding Relationships

Standard RAG treats documents as independent chunks. Graph RAG captures relationships between entities, enabling multi-hop reasoning.

from langchain_graph_rag import GraphRAG

graph_rag = GraphRAG( # Extract entities and relationships from documents entity_extractor="gpt-4o-mini", graph_store="neo4j", # or "memgraph", "kuzu"

# Indexing configuration relationship_extraction={ "types": ["DEPENDS_ON", "USES", "ALTERNATIVE_TO", "PART_OF"], "max_depth": 3, },

# Retrieval hybrid_search=True, reranker=True, community_detection=True, # Group related entities )

# Index documents graph_rag.add_documents(documents)

# Query with multi-hop reasoning result = graph_rag.query("Which vector databases support hybrid search and have Python SDKs?") # Without Graph RAG: "Does Pinecone support hybrid search?" # With Graph RAG: "Pinecone → supports → hybrid search, Pinecone → has → Python SDK" # "Weaviate → supports → hybrid search, Weaviate → has → Python SDK"

Graph RAG excels at questions involving: - Comparisons ("What's better, X or Y for my use case?") - Causal chains ("Why does Z happen when I use X?") - Multi-step ("Find tools that do A and B but not C") - Trade-offs ("What do I give up by choosing X over Y?")

5. RAPTOR: Recursive Abstractive Processing

RAPTOR creates a hierarchical summary tree. Documents are chunked, chunks are summarized, summaries are summarized again — creating layers of abstraction.

Level 0: Raw chunks (1000s of small pieces)
    ↓
Level 1: Topic clusters (100s of medium pieces)
    ↓
Level 2: Section summaries (10s of summaries)
    ↓
Level 3: Document summary (1 executive summary)

During retrieval, RAPTOR searches all levels and picks the most relevant. A question like "What's the main finding of this 200-page report?" matches Level 3 — it doesn't need to search a million chunks.

from raptor import RAPTOR

raptor = RAPTOR( embed_model="text-embedding-3-large", llm_model="gpt-4o-mini", # For summarization clustering="gmm", # Gaussian Mixture for better topic clustering levels=4, )

tree = raptor.index_documents(documents) results = raptor.retrieve("What are the key conclusions?", levels=[2, 3])

6. Evaluation: Don't Deploy Without It

The #1 RAG mistake? Deploying without evaluation. Production RAG needs continuous monitoring:

from ragas import evaluate
from ragas.metrics import (
    faithfulness,
    answer_relevancy,
    context_precision,
    context_recall,
)

# Create test set test_set = [ { "question": "How do I set up Claude Code with DeepSeek?", "answer": "...", # Generated by your RAG system "contexts": ["..."], # Retrieved documents "ground_truth": "...", # Ideal answer } ]

# Run evaluation results = evaluate( dataset=test_set, metrics=[ faithfulness, # Does the answer stay true to the context? answer_relevancy, # Is the answer relevant to the question? context_precision, # Are the retrieved documents all relevant? context_recall, # Are all relevant documents retrieved? ], )

print(f""" Faithfulness: {results['faithfulness']:.2%} Answer Relevancy: {results['answer_relevancy']:.2%} Context Precision: {results['context_precision']:.2%} Context Recall: {results['context_recall']:.2%} """)

Production targets: | Metric | Good | Great | Excellent | |--------|------|-------|-----------| | Faithfulness | >85% | >92% | >97% | | Answer Relevancy | >80% | >90% | >95% | | Context Precision | >75% | >85% | >92% | | Context Recall | >70% | >82% | >90% |

7. Putting It All Together: Production Pipeline

Here's what a complete production RAG pipeline looks like in 2026:

class ProductionRAGPipeline:
    def __init__(self):
        self.query_rewriter = QueryRewriter()
        self.hybrid_retriever = HybridRetriever(vector_store, documents)
        self.reranker = Reranker()
        self.graph_rag = GraphRAG(...)
        self.llm = ChatOpenAI(model="gpt-4o")

def query(self, user_query: str, chat_history: list = None): # 1. Rewrite query for optimal retrieval rewritten = self.query_rewriter.rewrite(user_query, chat_history or [])

# 2. Hybrid search (dense + sparse) initial_results = self.hybrid_retriever.retrieve(rewritten, alpha=0.5, top_k=10)

# 3. Graph-aware expansion graph_context = self.graph_rag.query(rewritten)

# 4. Rerank with cross-encoder doc_texts = [doc for doc, _ in initial_results] reranked = self.reranker.rerank(rewritten, doc_texts, top_k=4)

# 5. Generate response context = self._build_context(reranked, graph_context) response = self.llm.invoke(f"Context:\n{context}\n\nQuestion: {user_query}") return response

Upgrade Path

Where to invest based on your current RAG quality:

| If your accuracy is... | Invest in... | Expected gain | |----------------------|--------------|---------------| | <60% | Better chunking + reranking | +25% | | 60-75% | Hybrid search + query rewriting | +15% | | 75-85% | Cross-encoder reranking | +10% | | 85-92% | Graph RAG + entity extraction | +7% | | 92-97% | RAPTOR hierarchical retrieval | +3% | | >97% | You're done. Focus on latency and cost |

Quick Start

pip install langchain pinecone-client sentence-transformers ragas

The biggest bang for your buck in 2026: hybrid search + cross-encoder reranking. Those two techniques alone will get most RAG systems above 90% retrieval accuracy. Graph RAG and RAPTOR are for the last few percentage points — only invest if your baseline is already solid.

Ad Unit Placeholder

Related Articles