Advanced RAG Techniques in 2026: Hybrid Search, Graph RAG, Reranking, and Evaluation

Beyond Basic RAG

Basic RAG (chunk documents → embed → top-k similarity search → LLM generation) works for demos. But production systems in 2026 use 5+ advanced techniques stacked together to achieve >95% retrieval accuracy.

Here's what a production-grade RAG pipeline looks like:

Query → Query Rewriting → Hybrid Search → Reranking → Graph Augmentation → LLM Generation

Let's break down each layer.

1. Query Rewriting & Decomposition

Raw user queries are terrible for retrieval. Users ask "How do I fix the error?" — that's useless for vector search.

Query Rewriting transforms the user query before retrieval:

from openai import OpenAI
def rewrite_query(query: str, chat_history: list) -> str:
    """Rewrite a raw user query into a standalone, search-optimized query."""
    client = OpenAI()
    messages = [
        {"role": "system", "content": """
        Rewrite the user's query into a standalone question optimized for
        document retrieval. Expand abbreviations, resolve references to
        chat history, and use domain-specific terminology.
        Return ONLY the rewritten query, nothing else.
        """},
        *chat_history[-6:],  # Last 6 messages for context
        {"role": "user", "content": query},
    ]
    response = client.chat.completions.create(
        model="gpt-4o-mini",  # Cheap model works fine for this
        messages=messages,
        temperature=0.1,
    )
    return response.choices[0].message.content# Before: "How do I fix it?"
# After: "How to fix the 'context window full' error in Claude Code"

Query Decomposition splits complex multi-part questions:

def decompose_query(query: str) -> list[str]:
    """Break a multi-part query into individual retrieval sub-queries."""
    client = OpenAI()
    response = client.chat.completions.create(
        model="gpt-4o-mini",
        messages=[{
            "role": "system",
            "content": "Break this complex question into 2-5 simpler sub-questions. Return as a JSON array of strings."
        }, {
            "role": "user",
            "content": query
        }],
        response_format={"type": "json_object"},
    )
    sub_queries = json.loads(response.choices[0].message.content)["queries"]
    return sub_queries# Input: "Compare LangChain and LlamaIndex for RAG with Pinecone, including costs and scaling"
# Output: ["What is LangChain's RAG architecture with Pinecone?", "What is LlamaIndex's RAG architecture with Pinecone?", ...]

2. Hybrid Search: Dense + Sparse

Vector search alone misses exact keyword matches. Keyword search alone misses semantic matches. Hybrid search combines both.

from langchain_community.retrievers import BM25Retriever
from langchain_community.vectorstores import Pinecone
import numpy as np
class HybridRetriever:
    def __init__(self, vector_store, documents):
        self.vector_store = vector_store
        self.bm25_retriever = BM25Retriever.from_documents(documents)
        self.bm25_retriever.k = 10
    def retrieve(self, query: str, alpha: float = 0.5, top_k: int = 5):
        """
        Combines dense (vector) and sparse (BM25) retrieval.
        alpha = 0 → pure BM25, alpha = 1 → pure vector search.
        """
        # Dense retrieval
        dense_results = self.vector_store.similarity_search_with_relevance_scores(
            query, k=10
        )
        dense_scores = {doc.page_content: score for doc, score in dense_results}
        # Sparse retrieval (BM25)
        sparse_results = self.bm25_retriever.invoke(query)
        max_sparse = len(sparse_results)  # Normalize to 0-1
        sparse_scores = {}
        for i, doc in enumerate(sparse_results):
            sparse_scores[doc.page_content] = 1 - (i / max_sparse)
        # Combine scores
        all_docs = set(list(dense_scores.keys()) + list(sparse_scores.keys()))
        combined = []
        for doc in all_docs:
            dense_score = dense_scores.get(doc, 0)
            sparse_score = sparse_scores.get(doc, 0)
            combined_score = alpha  dense_score + (1 - alpha)  sparse_score
            combined.append((doc, combined_score))        # Sort and return top-k
        combined.sort(key=lambda x: x[1], reverse=True)
        return combined[:top_k]

When to adjust alpha: - Technical code queries → α = 0.3 (BM25 weighted higher — exact variable names matter) - Conceptual questions → α = 0.7 (semantic matching matters more) - Product names/APIs → α = 0.2 (keywords critical) - Default starting point → α = 0.5

3. Cross-Encoder Reranking

The biggest single improvement you can make to RAG quality is reranking. A cross-encoder reads query + document together (not independently like bi-encoders), giving much more accurate relevance scores.

from sentence_transformers import CrossEncoder
import torch
class Reranker:
    def __init__(self, model_name: str = "BAAI/bge-reranker-v2-m3"):
        self.model = CrossEncoder(model_name)
        # Optimized for:
        # - MS-MARCO (web/docs)
        # - BGE-Reranker (balanced)
        # - Cohere Rerank (API-based)
    def rerank(self, query: str, documents: list, top_k: int = 3, apply_threshold: float = 0.0):
        """Rerank documents using cross-encoder."""
        pairs = [[query, doc] for doc in documents]
        with torch.no_grad():
            scores = self.model.predict(pairs)
        # Sort by score descending
        scored_docs = list(zip(documents, scores))
        scored_docs.sort(key=lambda x: x[1], reverse=True)
        # Apply threshold filter
        filtered = [(doc, score) for doc, score in scored_docs if score >= apply_threshold]        return filtered[:top_k]

Reranking impact (benchmark):

Retrieval Method	Top-5 Accuracy	Top-3 Accuracy	Top-1 Accuracy
Dense only	78%	65%	42%
Hybrid (dense + BM25)	86%	74%	51%
Hybrid + Reranking	94%	88%	73%

Reranking consistently improves top-1 accuracy by 20-25 percentage points.

4. Graph RAG: Understanding Relationships

Standard RAG treats documents as independent chunks. Graph RAG captures relationships between entities, enabling multi-hop reasoning.

from langchain_graph_rag import GraphRAG
graph_rag = GraphRAG(
    # Extract entities and relationships from documents
    entity_extractor="gpt-4o-mini",
    graph_store="neo4j",  # or "memgraph", "kuzu"
    # Indexing configuration
    relationship_extraction={
        "types": ["DEPENDS_ON", "USES", "ALTERNATIVE_TO", "PART_OF"],
        "max_depth": 3,
    },
    # Retrieval
    hybrid_search=True,
    reranker=True,
    community_detection=True,  # Group related entities
)
# Index documents
graph_rag.add_documents(documents)# Query with multi-hop reasoning
result = graph_rag.query("Which vector databases support hybrid search and have Python SDKs?")
# Without Graph RAG: "Does Pinecone support hybrid search?"
# With Graph RAG: "Pinecone → supports → hybrid search, Pinecone → has → Python SDK"
#                 "Weaviate → supports → hybrid search, Weaviate → has → Python SDK"

Graph RAG excels at questions involving: - Comparisons ("What's better, X or Y for my use case?") - Causal chains ("Why does Z happen when I use X?") - Multi-step ("Find tools that do A and B but not C") - Trade-offs ("What do I give up by choosing X over Y?")

5. RAPTOR: Recursive Abstractive Processing

RAPTOR creates a hierarchical summary tree. Documents are chunked, chunks are summarized, summaries are summarized again — creating layers of abstraction.

Level 0: Raw chunks (1000s of small pieces)
    ↓
Level 1: Topic clusters (100s of medium pieces)
    ↓
Level 2: Section summaries (10s of summaries)
    ↓
Level 3: Document summary (1 executive summary)

During retrieval, RAPTOR searches all levels and picks the most relevant. A question like "What's the main finding of this 200-page report?" matches Level 3 — it doesn't need to search a million chunks.

from raptor import RAPTOR
raptor = RAPTOR(
    embed_model="text-embedding-3-large",
    llm_model="gpt-4o-mini",  # For summarization
    clustering="gmm",  # Gaussian Mixture for better topic clustering
    levels=4,
)tree = raptor.index_documents(documents)
results = raptor.retrieve("What are the key conclusions?", levels=[2, 3])

6. Evaluation: Don't Deploy Without It

The #1 RAG mistake? Deploying without evaluation. Production RAG needs continuous monitoring:

from ragas import evaluate
from ragas.metrics import (
    faithfulness,
    answer_relevancy,
    context_precision,
    context_recall,
)
# Create test set
test_set = [
    {
        "question": "How do I set up Claude Code with DeepSeek?",
        "answer": "...",       # Generated by your RAG system
        "contexts": ["..."],   # Retrieved documents
        "ground_truth": "...", # Ideal answer
    }
]
# Run evaluation
results = evaluate(
    dataset=test_set,
    metrics=[
        faithfulness,        # Does the answer stay true to the context?
        answer_relevancy,    # Is the answer relevant to the question?
        context_precision,   # Are the retrieved documents all relevant?
        context_recall,      # Are all relevant documents retrieved?
    ],
)print(f"""
Faithfulness:      {results['faithfulness']:.2%}
Answer Relevancy:  {results['answer_relevancy']:.2%}
Context Precision: {results['context_precision']:.2%}
Context Recall:    {results['context_recall']:.2%}
""")

Production targets:

Metric	Good	Great	Excellent
Faithfulness	>85%	>92%	>97%
Answer Relevancy	>80%	>90%	>95%
Context Precision	>75%	>85%	>92%
Context Recall	>70%	>82%	>90%

7. Putting It All Together: Production Pipeline

Here's what a complete production RAG pipeline looks like in 2026:

class ProductionRAGPipeline:
    def __init__(self):
        self.query_rewriter = QueryRewriter()
        self.hybrid_retriever = HybridRetriever(vector_store, documents)
        self.reranker = Reranker()
        self.graph_rag = GraphRAG(...)
        self.llm = ChatOpenAI(model="gpt-4o")
    def query(self, user_query: str, chat_history: list = None):
        # 1. Rewrite query for optimal retrieval
        rewritten = self.query_rewriter.rewrite(user_query, chat_history or [])
        # 2. Hybrid search (dense + sparse)
        initial_results = self.hybrid_retriever.retrieve(rewritten, alpha=0.5, top_k=10)
        # 3. Graph-aware expansion
        graph_context = self.graph_rag.query(rewritten)
        # 4. Rerank with cross-encoder
        doc_texts = [doc for doc, _ in initial_results]
        reranked = self.reranker.rerank(rewritten, doc_texts, top_k=4)        # 5. Generate response
        context = self._build_context(reranked, graph_context)
        response = self.llm.invoke(f"Context:\n{context}\n\nQuestion: {user_query}")
        return response

Upgrade Path

Where to invest based on your current RAG quality:

If your accuracy is...	Invest in...	Expected gain
<60%	Better chunking + reranking	+25%
60-75%	Hybrid search + query rewriting	+15%
75-85%	Cross-encoder reranking	+10%
85-92%	Graph RAG + entity extraction	+7%
92-97%	RAPTOR hierarchical retrieval	+3%
>97%	You're done. Focus on latency and cost

Quick Start

pip install langchain pinecone-client sentence-transformers ragas

The biggest bang for your buck in 2026: hybrid search + cross-encoder reranking. Those two techniques alone will get most RAG systems above 90% retrieval accuracy. Graph RAG and RAPTOR are for the last few percentage points — only invest if your baseline is already solid.