
Here's a scenario that plays out constantly in production RAG systems: your legal document assistant handles broad, conceptual questions beautifully — "What are the key risks in this contract?" — but falls apart the moment someone asks for something precise: "What does clause 14.3(b) say about indemnification?" The dense retrieval model, trained to understand semantic similarity, has no idea that 14.3(b) is critically specific. It returns vaguely relevant chunks about indemnification generally and misses the exact clause entirely.
The reverse problem is just as common. Systems built on keyword search — BM25 or similar sparse methods — handle exact match queries with precision but completely fail when users ask things like "what are the force majeure provisions?" using language that doesn't appear verbatim in the document. Users shouldn't have to know which retrieval paradigm their query fits before they ask a question.
Hybrid search solves this by combining dense vector retrieval (semantic understanding) with sparse keyword retrieval (exact match precision) into a unified pipeline. By the end of this lesson, you'll have a complete, production-ready hybrid RAG system that outperforms either approach alone — and you'll understand why the fusion works so you can tune it intelligently rather than treating it as a black box.
What you'll learn:
You should be comfortable with:
langchain or similar LLM frameworks is helpful but not requiredWe'll use OpenAI embeddings, Qdrant as the vector store, and rank_bm25 for sparse retrieval. You'll need an OpenAI API key and a Python environment with the following packages installed:
pip install openai qdrant-client rank-bm25 nltk numpy pandas tiktoken
Dense retrieval works by encoding both your documents and queries into high-dimensional vectors, then finding the closest vectors by cosine or dot-product similarity. Models like text-embedding-3-small or all-MiniLM-L6-v2 are trained on massive corpora to place semantically similar content near each other in this vector space.
This is genuinely powerful. "car accident" and "vehicle collision" land near each other. "How do I terminate this agreement?" retrieves relevant clauses even if they use words like "discontinue" or "dissolve." Dense retrieval handles paraphrase, synonymy, and conceptual similarity extremely well.
But it has a structural weakness: it compresses meaning into a fixed-size vector. A 1536-dimensional vector representing a 500-word chunk simply cannot preserve every specific detail in that chunk with full fidelity. When a query contains a highly specific term — a product SKU, a legal clause reference, a person's name, a medical code, a precise numerical threshold — that specificity can be washed out by the broader semantic signal.
Consider these two document chunks:
These are semantically very similar — they'd likely land close together in vector space. But if you ask "what is the royalty rate?", returning the wrong chunk gives the LLM wrong information to reason from. The number is the entire point.
BM25 (Best Match 25) is the gold standard of sparse retrieval. It's a probabilistic relevance model that ranks documents based on term frequency, inverse document frequency, and document length normalization. If your query shares vocabulary with a document, BM25 finds it efficiently and precisely.
Here's the problem: BM25 is lexically rigid. It has no concept of meaning — only character sequences. "Agreement termination" and "contract cancellation" share zero tokens, so BM25 assigns zero relevance between a document using one phrase and a query using the other. This is vocabulary mismatch, and it's pervasive in real-world usage. Users don't consult a controlled vocabulary before asking questions.
BM25 is also brittle with multilingual content, domain-specific abbreviations that vary by author, and any query phrased at a level of abstraction above the document's specific language.
So we're left with a clear trade-off: dense retrieval excels at semantic generalization but loses precision on specifics; sparse retrieval excels at exact matching but fails on semantic variation. The practical question is how to combine them.
Hybrid search runs your query through both retrieval systems independently, gets two ranked lists of results, and then fuses them into a single ranked list. The fusion step is where the interesting engineering lives.
The naive approach — averaging normalized relevance scores — is fragile. BM25 scores and cosine similarity scores live on completely different scales and have different distributions depending on the corpus. Normalizing them introduces its own assumptions and is sensitive to outliers.
Reciprocal Rank Fusion (RRF) sidesteps this entirely by ignoring raw scores and working only with ranks. The formula for a document's RRF score is:
RRF(d) = Σ 1 / (k + rank_i(d))
Where rank_i(d) is the document's rank in retrieval system i, and k is a smoothing constant (typically 60). The constant k prevents the top-ranked documents from having an overwhelming advantage and makes the ranking robust to noise at the top of each list.
The intuition: a document that appears at rank 1 in both systems gets a score of 1/(60+1) + 1/(60+1) ≈ 0.033. A document appearing at rank 1 in one system and rank 100 in the other gets 1/61 + 1/160 ≈ 0.022. A document at rank 5 in both gets 1/65 + 1/65 ≈ 0.031 — nearly as high as the first example. RRF is designed to reward consistent presence across systems rather than dominance in just one.
Alternatively, you can use a weighted linear combination if you want to explicitly favor one retrieval method for your particular domain. We'll implement both approaches and show you when each is appropriate.
Let's build this end-to-end using a realistic scenario: a RAG system over a set of software vendor contracts. This is exactly the kind of corpus that breaks pure dense or pure sparse retrieval — it contains both precise legal references (clause numbers, defined terms, exact monetary thresholds) and conceptual content (risk allocation, governing law philosophy, SLA expectations).
First, let's set up our document processing. We'll chunk documents with moderate overlap to preserve context across chunk boundaries.
import os
import re
import json
from typing import List, Dict, Tuple
import numpy as np
from openai import OpenAI
client = OpenAI(api_key=os.environ["OPENAI_API_KEY"])
def chunk_document(text: str, chunk_size: int = 400, overlap: int = 80) -> List[Dict]:
"""
Split a document into overlapping chunks.
Returns list of dicts with 'text', 'chunk_id', and 'token_count'.
"""
# Simple word-based splitting — for production use tiktoken for token-accurate chunking
words = text.split()
chunks = []
start = 0
chunk_id = 0
while start < len(words):
end = min(start + chunk_size, len(words))
chunk_text = " ".join(words[start:end])
chunks.append({
"chunk_id": chunk_id,
"text": chunk_text,
"word_count": len(words[start:end])
})
chunk_id += 1
start += chunk_size - overlap # advance with overlap
return chunks
# Sample contracts — in production you'd load these from PDFs or a document store
contracts = {
"vendor_acme_2024": """
Master Service Agreement between Acme Corp and TechVendor Inc.
Effective Date: January 15, 2024.
Section 1. Services. TechVendor shall provide cloud infrastructure services
as described in Schedule A, including 99.9% uptime SLA measured monthly.
Section 4.2. Payment Terms. Invoices are due net-30. Late payments accrue
interest at 1.5% per month. Annual contract value is $240,000.
Section 7. Indemnification. Each party shall indemnify and hold harmless
the other from third-party claims arising from that party's gross negligence
or willful misconduct. Indemnification obligations survive termination.
Section 9.1. Limitation of Liability. Neither party's aggregate liability
shall exceed the total fees paid in the twelve months preceding the claim.
Cap on damages excludes indemnification obligations under Section 7.
Section 14.3(b). Data Processing. Vendor shall process personal data only
on documented instructions from Customer. Vendor implements technical and
organizational measures meeting ISO 27001 standards. Data deletion
within 30 days of contract termination.
Section 18. Termination for Convenience. Either party may terminate
with 90 days written notice. Customer owes fees through termination date
plus a 15% early termination fee if within first 12 months.
""",
"vendor_globalsys_2024": """
Software License and Services Agreement with GlobalSys Solutions.
Start Date: March 1, 2024. Term: 36 months.
Article 2. License Grant. GlobalSys grants a non-exclusive, non-transferable
license for internal business operations only. No sublicensing permitted
without prior written consent.
Article 5. Fees and Payment. Annual license fee of $180,000, payable quarterly.
Price increases capped at CPI plus 3% annually.
Article 8. Service Levels. Vendor guarantees 99.95% monthly uptime for
production systems. Downtime credits: 10% of monthly fee per hour beyond SLA.
Credits do not exceed 30% of monthly fee.
Article 11. Intellectual Property. All customizations developed under
this agreement are work-for-hire and vest in Customer upon payment.
Pre-existing IP remains with originating party.
Article 15. Governing Law. Agreement governed by laws of Delaware.
Disputes resolved through binding arbitration under AAA Commercial Rules.
Arbitration seat: New York, NY.
"""
}
# Process all contracts into chunks
all_chunks = []
for doc_id, text in contracts.items():
chunks = chunk_document(text)
for chunk in chunks:
chunk["doc_id"] = doc_id
all_chunks.append(chunk)
print(f"Total chunks created: {len(all_chunks)}")
for i, chunk in enumerate(all_chunks):
print(f"Chunk {i}: {chunk['text'][:80]}...")
Now we embed all chunks and store them in Qdrant. We'll store the chunk text in the payload so we can retrieve it without a separate lookup.
from qdrant_client import QdrantClient
from qdrant_client.models import (
Distance, VectorParams, PointStruct, Filter, FieldCondition, MatchValue
)
# Initialize Qdrant in-memory for this example
# For production: QdrantClient(host="localhost", port=6333) or cloud URL
qdrant = QdrantClient(":memory:")
COLLECTION_NAME = "contracts_hybrid"
EMBEDDING_MODEL = "text-embedding-3-small"
VECTOR_DIM = 1536
def get_embeddings(texts: List[str], batch_size: int = 100) -> List[List[float]]:
"""Embed a list of texts using OpenAI API with batching."""
all_embeddings = []
for i in range(0, len(texts), batch_size):
batch = texts[i:i + batch_size]
response = client.embeddings.create(
model=EMBEDDING_MODEL,
input=batch
)
batch_embeddings = [item.embedding for item in response.data]
all_embeddings.extend(batch_embeddings)
return all_embeddings
# Create collection
qdrant.create_collection(
collection_name=COLLECTION_NAME,
vectors_config=VectorParams(size=VECTOR_DIM, distance=Distance.COSINE)
)
# Embed all chunks
print("Embedding chunks...")
texts = [chunk["text"] for chunk in all_chunks]
embeddings = get_embeddings(texts)
# Upload to Qdrant with full metadata in payload
points = [
PointStruct(
id=i,
vector=embeddings[i],
payload={
"text": all_chunks[i]["text"],
"doc_id": all_chunks[i]["doc_id"],
"chunk_id": all_chunks[i]["chunk_id"]
}
)
for i in range(len(all_chunks))
]
qdrant.upsert(collection_name=COLLECTION_NAME, points=points)
print(f"Uploaded {len(points)} vectors to Qdrant")
For BM25, we tokenize our chunks and build an in-memory index. Note the preprocessing pipeline — lowercasing, removing punctuation, and basic stopword removal all affect retrieval quality.
from rank_bm25 import BM25Okapi
import nltk
from nltk.tokenize import word_tokenize
from nltk.corpus import stopwords
nltk.download('punkt', quiet=True)
nltk.download('punkt_tab', quiet=True)
nltk.download('stopwords', quiet=True)
STOP_WORDS = set(stopwords.words('english'))
def tokenize_for_bm25(text: str) -> List[str]:
"""
Tokenize text for BM25 indexing.
We keep numbers and legal terms — removing them would hurt precision
on exactly the cases where BM25 shines.
"""
tokens = word_tokenize(text.lower())
# Remove punctuation tokens but keep alphanumeric (including numbers)
tokens = [t for t in tokens if t.isalnum()]
# Light stopword removal — be conservative here
tokens = [t for t in tokens if t not in STOP_WORDS or len(t) <= 2]
return tokens
# Build BM25 index over all chunks
tokenized_corpus = [tokenize_for_bm25(chunk["text"]) for chunk in all_chunks]
bm25_index = BM25Okapi(tokenized_corpus)
print(f"BM25 index built over {len(tokenized_corpus)} documents")
print(f"Sample tokens from chunk 0: {tokenized_corpus[0][:20]}")
Warning: BM25 tokenization choices significantly affect retrieval quality. If you strip too aggressively (removing numbers, short tokens), you defeat the purpose of having BM25 — which is precisely to catch specific terms. If you don't normalize enough, you'll miss matches due to case differences or punctuation variations.
This is the core of hybrid search. The function takes results from both systems and produces a unified ranking.
def reciprocal_rank_fusion(
dense_results: List[Dict],
sparse_results: List[Dict],
k: int = 60,
dense_weight: float = 1.0,
sparse_weight: float = 1.0
) -> List[Dict]:
"""
Merge two ranked result lists using Reciprocal Rank Fusion.
Args:
dense_results: List of dicts with 'id', 'text', 'score', 'doc_id'
sparse_results: Same format
k: RRF smoothing constant (default 60 is standard)
dense_weight: Multiplier for dense retrieval contribution
sparse_weight: Multiplier for sparse retrieval contribution
Returns:
Merged and re-ranked list of result dicts with 'rrf_score'
"""
# Build a unified document registry
doc_registry = {}
# Process dense results
for rank, result in enumerate(dense_results):
doc_id = result["id"]
if doc_id not in doc_registry:
doc_registry[doc_id] = {
"id": doc_id,
"text": result["text"],
"doc_id": result.get("doc_id", ""),
"rrf_score": 0.0,
"dense_rank": None,
"sparse_rank": None
}
rrf_contribution = dense_weight * (1.0 / (k + rank + 1))
doc_registry[doc_id]["rrf_score"] += rrf_contribution
doc_registry[doc_id]["dense_rank"] = rank + 1
# Process sparse results
for rank, result in enumerate(sparse_results):
doc_id = result["id"]
if doc_id not in doc_registry:
doc_registry[doc_id] = {
"id": doc_id,
"text": result["text"],
"doc_id": result.get("doc_id", ""),
"rrf_score": 0.0,
"dense_rank": None,
"sparse_rank": None
}
rrf_contribution = sparse_weight * (1.0 / (k + rank + 1))
doc_registry[doc_id]["rrf_score"] += rrf_contribution
doc_registry[doc_id]["sparse_rank"] = rank + 1
# Sort by RRF score descending
merged = sorted(doc_registry.values(), key=lambda x: x["rrf_score"], reverse=True)
return merged
def dense_search(query: str, top_k: int = 10) -> List[Dict]:
"""Run dense vector search against Qdrant."""
query_embedding = get_embeddings([query])[0]
results = qdrant.search(
collection_name=COLLECTION_NAME,
query_vector=query_embedding,
limit=top_k
)
return [
{
"id": result.id,
"text": result.payload["text"],
"doc_id": result.payload["doc_id"],
"score": result.score
}
for result in results
]
def sparse_search(query: str, top_k: int = 10) -> List[Dict]:
"""Run BM25 sparse search over the chunk corpus."""
query_tokens = tokenize_for_bm25(query)
scores = bm25_index.get_scores(query_tokens)
# Get top-k indices by score
top_indices = np.argsort(scores)[::-1][:top_k]
return [
{
"id": int(idx),
"text": all_chunks[idx]["text"],
"doc_id": all_chunks[idx]["doc_id"],
"score": float(scores[idx])
}
for idx in top_indices
if scores[idx] > 0 # Only return chunks with non-zero BM25 score
]
def hybrid_search(
query: str,
top_k: int = 5,
dense_top_k: int = 10,
sparse_top_k: int = 10,
dense_weight: float = 1.0,
sparse_weight: float = 1.0
) -> List[Dict]:
"""
Full hybrid search: run both retrieval systems and fuse results.
Retrieve more candidates than needed from each system, then fuse and truncate.
"""
dense_results = dense_search(query, top_k=dense_top_k)
sparse_results = sparse_search(query, top_k=sparse_top_k)
fused = reciprocal_rank_fusion(
dense_results,
sparse_results,
dense_weight=dense_weight,
sparse_weight=sparse_weight
)
return fused[:top_k]
Tip: Always retrieve more candidates from each system than your final
top_k. If you want 5 results, retrieve 10-15 from each system before fusing. A document that's rank 8 in dense and rank 7 in sparse is highly relevant — but you'd miss it entirely if each system only returned 5 results.
Now let's build the complete generation step. The retrieval is only half the battle — how you construct the prompt and what you do with the retrieved chunks matters enormously.
def format_context_for_prompt(results: List[Dict]) -> str:
"""
Format retrieved chunks as structured context for the LLM.
Including the source document helps the model cite correctly.
"""
context_parts = []
for i, result in enumerate(results):
source = result.get("doc_id", "unknown").replace("_", " ").title()
context_parts.append(
f"[Source {i+1}: {source}]\n{result['text']}"
)
return "\n\n---\n\n".join(context_parts)
def rag_query(
question: str,
top_k: int = 5,
dense_weight: float = 1.0,
sparse_weight: float = 1.0,
model: str = "gpt-4o-mini"
) -> Dict:
"""
Complete RAG pipeline with hybrid retrieval.
Returns answer plus metadata about which sources were used.
"""
# Step 1: Hybrid retrieval
retrieved = hybrid_search(
query=question,
top_k=top_k,
dense_weight=dense_weight,
sparse_weight=sparse_weight
)
if not retrieved:
return {"answer": "No relevant content found.", "sources": []}
# Step 2: Format context
context = format_context_for_prompt(retrieved)
# Step 3: Construct prompt
system_prompt = """You are a contract analysis assistant. Answer questions based
strictly on the contract excerpts provided. If the information isn't in the provided
excerpts, say so explicitly rather than inferring or guessing. When citing specific
clauses or numbers, be precise."""
user_prompt = f"""Based on the following contract excerpts, answer this question:
Question: {question}
Contract Excerpts:
{context}
Provide a clear, precise answer. If you reference specific numbers, terms, or clause
numbers, quote them directly from the source text."""
# Step 4: Generate response
response = client.chat.completions.create(
model=model,
messages=[
{"role": "system", "content": system_prompt},
{"role": "user", "content": user_prompt}
],
temperature=0.1 # Low temperature for factual contract queries
)
answer = response.choices[0].message.content
return {
"question": question,
"answer": answer,
"sources": [
{
"doc_id": r["doc_id"],
"text_preview": r["text"][:120] + "...",
"rrf_score": round(r["rrf_score"], 4),
"dense_rank": r.get("dense_rank"),
"sparse_rank": r.get("sparse_rank")
}
for r in retrieved
]
}
Let's run some queries that deliberately stress-test each retrieval type, so you can see the hybrid advantage in action.
def compare_retrieval_methods(query: str, top_k: int = 3):
"""Compare dense-only, sparse-only, and hybrid retrieval for a query."""
print(f"\n{'='*70}")
print(f"QUERY: {query}")
print('='*70)
dense_only = dense_search(query, top_k=top_k)
sparse_only = sparse_search(query, top_k=top_k)
hybrid_only = hybrid_search(query, top_k=top_k)
print("\n--- DENSE ONLY (semantic) ---")
for i, r in enumerate(dense_only):
print(f" {i+1}. [{r['doc_id']}] Score: {r['score']:.4f}")
print(f" {r['text'][:100]}...")
print("\n--- SPARSE ONLY (BM25) ---")
for i, r in enumerate(sparse_only):
print(f" {i+1}. [{r['doc_id']}] BM25: {r['score']:.4f}")
print(f" {r['text'][:100]}...")
print("\n--- HYBRID (RRF fused) ---")
for i, r in enumerate(hybrid_only):
d_rank = r.get('dense_rank', 'N/A')
s_rank = r.get('sparse_rank', 'N/A')
print(f" {i+1}. [{r['doc_id']}] RRF: {r['rrf_score']:.4f} "
f"(dense rank: {d_rank}, sparse rank: {s_rank})")
print(f" {r['text'][:100]}...")
# Test with a precision query — dense will struggle here
compare_retrieval_methods("section 14.3(b) data processing requirements")
# Test with a semantic query — sparse will struggle here
compare_retrieval_methods("what happens if a vendor fails to meet uptime commitments")
# Test with a mixed query — both systems contribute
compare_retrieval_methods("termination fee obligations and payment terms")
# Full RAG test
result = rag_query("What does clause 14.3(b) require regarding data handling?")
print(f"\n{'='*70}")
print(f"RAG ANSWER:\n{result['answer']}")
print("\nSOURCES USED:")
for s in result['sources']:
print(f" - {s['doc_id']} (RRF: {s['rrf_score']}, "
f"dense: {s['dense_rank']}, sparse: {s['sparse_rank']})")
One of the most common questions about hybrid search is: should I weight dense and sparse retrieval equally, or should I favor one? The answer depends entirely on your corpus characteristics and query distribution.
Lean toward higher sparse weight when:
Lean toward higher dense weight when:
Here's a simple evaluation harness you can use to find optimal weights for your corpus:
def evaluate_weights(
test_queries: List[Dict],
weight_combinations: List[Tuple[float, float]],
top_k: int = 5
) -> Dict:
"""
Evaluate different dense/sparse weight combinations against labeled test queries.
test_queries format: [{"query": str, "relevant_doc_ids": List[str]}]
Returns hit rates for each weight combination.
"""
results = {}
for dense_w, sparse_w in weight_combinations:
hits = 0
total = len(test_queries)
for test_case in test_queries:
retrieved = hybrid_search(
query=test_case["query"],
top_k=top_k,
dense_weight=dense_w,
sparse_weight=sparse_w
)
retrieved_doc_ids = {r["doc_id"] for r in retrieved}
relevant_found = any(
doc_id in retrieved_doc_ids
for doc_id in test_case["relevant_doc_ids"]
)
if relevant_found:
hits += 1
key = f"dense={dense_w}, sparse={sparse_w}"
results[key] = {
"hit_rate": hits / total,
"hits": hits,
"total": total
}
# Sort by hit rate
sorted_results = sorted(results.items(), key=lambda x: x[1]["hit_rate"], reverse=True)
print("Weight Evaluation Results (by Hit Rate):")
print(f"{'Configuration':<30} {'Hit Rate':>10} {'Hits':>6}/{total}")
print("-" * 55)
for config, metrics in sorted_results:
print(f"{config:<30} {metrics['hit_rate']:>9.1%} {metrics['hits']:>6}/{total}")
return dict(sorted_results)
# Example labeled test set — you'd build this from real user queries
test_queries = [
{
"query": "section 14.3(b) personal data obligations",
"relevant_doc_ids": ["vendor_acme_2024"]
},
{
"query": "what are the uptime guarantees and compensation for failures",
"relevant_doc_ids": ["vendor_acme_2024", "vendor_globalsys_2024"]
},
{
"query": "early termination fee calculation",
"relevant_doc_ids": ["vendor_acme_2024"]
},
{
"query": "intellectual property ownership of custom development",
"relevant_doc_ids": ["vendor_globalsys_2024"]
},
{
"query": "arbitration clause and governing jurisdiction",
"relevant_doc_ids": ["vendor_globalsys_2024"]
}
]
weight_combinations = [
(1.0, 1.0), # Equal weight
(1.5, 1.0), # Favor dense
(1.0, 1.5), # Favor sparse
(2.0, 1.0), # Heavy dense
(1.0, 2.0), # Heavy sparse
(0.7, 1.3), # Moderate sparse boost
]
evaluate_weights(test_queries, weight_combinations)
Tip: Build your test set from actual user queries whenever possible. Synthetic test queries written by the developer who built the system will always be biased toward the vocabulary and phrasing used in the documents. Real user queries are messier and more representative.
Build a hybrid RAG system for a domain of your choice. Here are the specific requirements:
Part 1 — Corpus Setup (30 minutes)
Choose one of these realistic corpora:
Chunk them with at least 3 different chunking strategies (different chunk sizes, with and without overlap) and observe how chunking affects retrieval quality.
Part 2 — Retrieval Comparison (30 minutes)
Write 10 test queries for your corpus: 5 that you expect to favor dense retrieval (conceptual, paraphrase-heavy) and 5 that should favor sparse retrieval (exact terms, model numbers, specific names). Run all three methods (dense-only, sparse-only, hybrid) and document where each succeeds or fails.
Part 3 — Weight Tuning (20 minutes)
Using your 10 labeled queries as a test set, run the evaluate_weights function across at least 6 weight combinations. What configuration works best for your corpus? Does the result surprise you?
Part 4 — End-to-End RAG (20 minutes)
Implement the full rag_query function against your corpus. Ask 3 questions:
Critically evaluate each answer: Is it correct? What would have gone wrong if you'd used only dense or only sparse retrieval?
Mistake 1: Not retrieving enough candidates before fusion
If you only retrieve top_k=5 from each system before fusing, you create a blind spot: documents ranked 6-10 in one system but potentially top-ranked in the other are invisible to RRF. Always retrieve 2x-3x your target top_k from each system before fusion.
Mistake 2: Aggressive BM25 tokenization that strips numbers
The whole point of sparse retrieval is precision on specific terms. If your tokenizer strips numbers, you lose all precision on queries like "clause 14.3" or "99.9% uptime" or "net-30 payment." Keep alphanumeric tokens intact.
Mistake 3: Identical preprocessing pipelines for indexing and querying
BM25 matches tokens exactly, so whatever preprocessing you applied at index time must be applied identically at query time. If you lowercased during indexing, you must lowercase queries. This sounds obvious but is a common source of silent bugs where BM25 suddenly returns nothing useful.
Mistake 4: Using RRF k=60 as sacred
The k=60 constant comes from the original RRF paper and works well in general web search. For short corpora with dense, highly relevant documents, you might benefit from smaller k values (like 20-30) which amplify the advantage of high-ranked documents. For noisy corpora, larger k values dampen the advantage of any single rank and smooth things out. Treat it as a hyperparameter.
Mistake 5: Not handling the "sparse returns zero results" case
If a query contains no tokens that appear in any document (this happens with purely conceptual queries or heavy use of stopwords), BM25 returns all-zero scores. The sparse_search function above handles this by filtering out zero-score results, but you should be explicit about this logic and decide whether to fall back to dense-only in that case rather than fusing an empty sparse list.
Mistake 6: Treating chunk boundaries as arbitrary
Documents have structure — sections, paragraphs, clauses. Naive fixed-size chunking cuts across these boundaries and creates chunks that start or end mid-sentence. For legal documents especially, a clause split across two chunks means no single chunk contains the complete clause, and retrieval will always be partial. Consider structural chunking: split on section boundaries when possible, using regex or document parsing to find natural break points.
Debugging tip: Log retrieval provenance
In production, always log which system contributed to each final result (dense rank, sparse rank, final RRF score). When users report bad answers, this log tells you immediately whether it's a retrieval failure (the right chunk wasn't retrieved) or a generation failure (the right chunk was there but the LLM ignored or misread it). These require completely different fixes.
You've built a complete hybrid RAG pipeline that combines dense vector search with BM25 sparse retrieval through Reciprocal Rank Fusion. You understand not just the mechanics but the why: dense retrieval handles semantic generalization, sparse retrieval handles lexical precision, and RRF fuses their ranked outputs in a way that rewards consistent relevance without requiring score normalization.
The key architectural decisions we made:
top_k from each systemWhere to take this next:
Contextual compression — Before sending retrieved chunks to the LLM, run each one through a compression step (another LLM call or extractive summarization) to extract only the relevant sentences. This reduces token usage and noise in the context window.
Query rewriting and decomposition — Complex multi-part questions often require separate retrieval runs for each sub-question. A query like "compare the indemnification provisions across both contracts" should be decomposed into separate retrievals before synthesis.
Re-ranking with a cross-encoder — After hybrid retrieval, a cross-encoder model (like Cohere Rerank or a local cross-encoder/ms-marco-MiniLM-L-6-v2) performs full attention over query + document pairs, giving a much more precise relevance score. This is the "retrieve then re-rank" pattern that powers production search systems at scale.
Metadata filtering — Combine hybrid search with metadata pre-filtering (by date, document type, department, or access permissions) to scope retrieval before ranking begins. Qdrant supports this natively with filtered search.
Evaluation frameworks — Move from ad-hoc testing to systematic evaluation using frameworks like RAGAS or TruLens, which measure faithfulness, answer relevance, and context recall as structured metrics you can track over time.
The hybrid search pattern is currently one of the highest-value improvements you can make to a RAG system at the practitioner level. Dense-only systems are leaving precision on the table; keyword-only systems are leaving semantic understanding on the table. Combining them captures both, and in production systems over real enterprise document corpora, the improvement in retrieval quality is consistently measurable.
Learning Path: Building with LLMs