Who This Guide Is For

This guide is designed for engineers building search and retrieval systems who want to understand how to combine keyword and semantic search effectively. It assumes you're already familiar with:

BM25 or keyword search — how lexical matching works and when it's useful
Embeddings and semantic search — using vector similarity for retrieval
RAG (Retrieval-Augmented Generation) basics — the general pattern of retrieving context for LLMs

New to these concepts? That's completely fine! If you'd like to start with the fundamentals, we recommend reading our BM25 Search in LLM Systems guide first, which covers keyword search from the ground up. For embeddings and RAG, there are excellent introductory resources from OpenAI, Pinecone, and LangChain that explain these concepts in depth.

This guide focuses specifically on Reciprocal Rank Fusion (RRF) — a practical technique for combining multiple retrieval methods. If you're already working with both BM25 and semantic search and wondering how to merge their results, you're in the right place.

Hybrid search combines BM25 (keyword matching) and semantic search (embedding similarity) for better results. But combining them is tricky — their scores are on different scales.

Reciprocal Rank Fusion (RRF) solves this by combining rankings instead of scores¹. No normalization needed, minimal tuning required. RRF has become widely adopted in production search systems due to its simplicity and effectiveness compared to score-based fusion methods.

BM25 vs. Semantic Search: When to Go Hybrid?

Before diving into RRF, it's worth understanding why you'd combine BM25 and semantic search in the first place. Each has distinct strengths:

BM25 excels at:

Exact keyword matching (product names, error codes, technical terms)
Queries where users know the precise terminology
Speed and simplicity — runs on CPU, no ML models required
Explainability — you can see exactly which terms matched

Semantic search excels at:

Conceptual matching (finding "authentication" when user searches "login")
Natural language queries and questions
Handling synonyms and paraphrasing
Discovering related content the user didn't know to search for

When hybrid makes sense: Most production RAG systems benefit from hybrid search because real queries mix both patterns. A query like "how to authenticate API requests" has strong keywords ("authenticate", "API") and semantic intent (the user wants authentication methods, not just documents containing those words).

If your queries are purely keyword-based (e.g., searching product SKUs), BM25 alone may suffice. If your queries are purely conceptual with varied phrasing, semantic search alone may work. But for most use cases — especially developer documentation, support systems, and knowledge bases — hybrid search consistently outperforms either method alone².

What is Reciprocal Rank Fusion?

RRF combines multiple ranked lists by working with rank positions instead of scores. Documents that rank high in multiple lists get boosted.

How RRF Works

RRF assigns scores based on rank position: $\frac{1}{k + \text{rank}}$ . Rank 1 gets the highest score, rank 2 gets slightly less, and so on. Documents appearing in multiple lists get their scores summed.

The $k$ parameter (default 60) controls how much top ranks dominate¹. Lower $k$ = more weight to top positions. Higher $k$ = more gradual decay.

Simple Example: Imagine we have just two documents and two ranked lists:

BM25 ranking: [Doc A, Doc B]
Semantic ranking: [Doc B, Doc A]

Using $k = 60$ :

Doc A: Ranks #1 in BM25, #2 in semantic → RRF score = $\frac{1}{60+1} + \frac{1}{60+2} = 0.0164 + 0.0161 = 0.0325$
Doc B: Ranks #2 in BM25, #1 in semantic → RRF score = $\frac{1}{60+2} + \frac{1}{60+1} = 0.0161 + 0.0164 = 0.0325$

Both documents get equal combined scores because they have symmetric rankings—each is #1 in one retriever and #2 in the other. This demonstrates how RRF treats both retrieval methods equally.

Score Decay by Rank: Here's how RRF scores decay across different rank positions. Try adjusting the k parameter to see how it affects the score distribution:

k parameter:

Try adjusting k: Lower values (20-40) favor top ranks more aggressively. Higher values (80-100) create more gradual decay. The default k=60 balances these trade-offs.

Rank	RRF Score (k=60)
1	0.0164
2	0.0161
3	0.0159
5	0.0154
10	0.0143
20	0.0125
50	0.0091
100	0.0063

RRF Score Decay by Rank Position (k=60)

Notice how the score decays rapidly for top ranks, then levels off. This means documents ranked in the top 10-20 positions contribute significantly more than documents ranked lower. Experiment with different k values to see how this decay pattern changes.

Why RRF Works Well

RRF solves the main problems with combining BM25 and semantic search:

Different score scales: BM25 scores (0-20) and semantic scores (0-1) aren't comparable. RRF uses ranks instead.

No normalization needed: Ranks are always comparable. Rank 1 means "best match" in both systems.

Handles missing results: If BM25 finds a document embeddings miss (or vice versa), RRF still works. Documents only contribute from lists where they appear.

Minimal tuning: Works well out of the box with $k = 60$ . While $k$ can be adjusted, RRF is generally robust to its value¹. Weighted fusion requires tuning a weight parameter, which needs validation data and iteration.

A Synthetic Example: RRF in Action

Let's see how RRF combines BM25 and semantic search results with a concrete example.

Query: "How do I authenticate API requests?"

Documents:

Doc A: "Authentication with API keys is the simplest method..."
Doc B: "OAuth 2.0 provides secure authentication for production..."
Doc C: "Rate limiting protects the API from abuse..."
Doc D: "Error handling for API requests includes 401 authentication failures..."
Doc E: "Query optimization improves retrieval performance..."

Here's how BM25 and semantic search might rank these documents, and how RRF combines them:

Document	BM25 Score	BM25 Rank	Semantic Score	Semantic Rank	RRF Score (k=60)	Final Rank
Doc A	8.5	1	0.82	3	1/61 + 1/63 = 0.0323	1
Doc B	6.2	3	0.88	1	1/63 + 1/61 = 0.0323	1 (tie)
Doc C	4.1	2	0.45	5	1/62 + 1/65 = 0.0315	3
Doc D	3.8	4	0.65	2	1/64 + 1/62 = 0.0317	2
Doc E	0.0	—	0.25	4	0 + 1/64 = 0.0156	4

Understanding the Rankings:

Doc A ranks #1 in BM25 because it contains the exact phrase "authentication with API keys" — strong keyword match. It ranks #3 in semantic search because the meaning is relevant but slightly less central than OAuth content.
Doc B ranks #1 in semantic search because "OAuth 2.0 provides secure authentication" captures the semantic intent of the query better, even though it doesn't use "API" as prominently. It ranks #3 in BM25 due to fewer keyword matches.
Doc A and Doc B have identical RRF scores (0.0323) because they have symmetric ranks: one is (1,3) and the other is (3,1). This is intentional — RRF treats both retrieval methods equally. Ties like this require a secondary tie-breaker in your implementation (e.g., original score, recency, or random selection).
Doc D outranks Doc C despite Doc C ranking higher in BM25, because Doc D's strong semantic rank (#2) gives it a higher combined score.
Doc E only appears in semantic results (BM25 score is 0, so it gets no BM25 contribution). It still receives an RRF score from its semantic rank, but ranks lowest overall.

Notice how RRF naturally balances the two retrieval methods. Documents that rank well in both lists get boosted, while documents that only rank well in one list still contribute but don't dominate.

Interactive Demo: Explore RRF with Real Queries

The interactive demo below lets you experiment with different queries and k values to see how RRF fusion works in practice. Try queries like "rate limiting" (keyword-heavy), "secure my app" (semantic), or "OAuth token" (both) to see how BM25 and semantic rankings combine.

k parameter60

5 docs matched

BM25 Rankings

#1Authentication3.6

#2Rate Limiting2.8

#3Error Handling2.8

#4Chunking Strategies1.4

Semantic Rankings

#1Authentication1.00

#2Error Handling1.00

#3Rate Limiting0.75

#4OAuth 2.0 Setup0.60

#5Chunking Strategies0.30

RRF Fused

Try different queries: “rate limiting” (keyword-heavy), “secure my app” (semantic), or “OAuth token” (both). Adjust k to see how it affects score decay.

Why Weighted Score Fusion Falls Short

Weighted fusion normalizes scores then combines them with a weighted average. But this has problems:

Score distributions differ: BM25 scores are skewed (most low, few high), while semantic scores cluster around 0.6-0.8. Normalization loses this information.

Normalization breaks: Score ranges change as your corpus grows or you switch embedding models. You'd need constant re-normalization.

Hard to tune: Finding the right weight requires validation data, grid search, and re-tuning as queries change. And one weight doesn't fit all queries.

Edge cases fail: What if one retriever returns no results? Or far fewer? Normalization breaks down.

RRF avoids all this by using ranks instead of scores.

Why Rerankers Are Complex

Rerankers can achieve 10-20% higher accuracy than RRF in some benchmarks³⁴, but they're much more complex:

Slow: 10-50ms per document means 500ms-2.5 seconds for top-50 candidates. RRF takes <1ms.

Need fine-tuning: Pre-trained models work okay, but you need to fine-tune on your domain, task, and metrics. This requires labeled data, GPUs, and ongoing re-training.

Expensive: GPU inference, infrastructure, and maintenance costs add up. The accuracy gain often isn't worth 100× the cost.

Black boxes: Can't explain why a document ranked high. RRF is interpretable — you can see it ranked #2 in BM25 and #3 in semantic search.

Rerankers make sense when latency isn't critical, accuracy is paramount, and you have labeled data and GPU resources. For most systems, RRF is the sweet spot: fast, simple, and accurate enough.

Implementing RRF

RRF is simple: assign scores based on rank ( $\frac{1}{k + \text{rank}}$ ), sum scores for documents in multiple lists, then sort.

1from collections import defaultdict
2
3def reciprocal_rank_fusion(
4    ranked_lists: list[list[str]],
5    k: int = 60
6) -> list[tuple[str, float]]:
7    """
8    Combine multiple ranked lists using Reciprocal Rank Fusion.
9
10    Args:
11        ranked_lists: List of ranked document ID lists. Each inner list
12                      is ordered by relevance (index 0 = rank 1).
13        k: RRF parameter controlling score decay. Default 60.
14
15    Returns:
16        List of (doc_id, rrf_score) tuples sorted by score descending.
17    """
18    rrf_scores = defaultdict(float)
19
20    for ranked_list in ranked_lists:
21        # start=1 because ranks start at 1, not 0 (first element is rank 1)
22        for rank, doc_id in enumerate(ranked_list, start=1):
23            rrf_scores[doc_id] += 1 / (k + rank)
24
25    # Sort by score descending
26    return sorted(rrf_scores.items(), key=lambda x: x[1], reverse=True)
27
28
29# Example usage
30bm25_results = ["doc_a", "doc_c", "doc_b", "doc_d"]
31semantic_results = ["doc_b", "doc_d", "doc_a", "doc_e"]
32
33fused = reciprocal_rank_fusion([bm25_results, semantic_results], k=60)
34# Returns: [('doc_a', 0.0323), ('doc_b', 0.0323), ('doc_d', 0.0317), ...]

Tuning RRF

The default $k = 60$ works well for most use cases and is widely adopted across production systems. The $k$ parameter controls how aggressively the RRF score decays as rank position increases.

Understanding the k Parameter

Lower $k$ (20-40): Concentrates weight on top-ranked documents. Use when your retrievers are highly accurate and you want to strongly favor documents that rank well in both systems.

Higher $k$ (80-100): Creates more gradual decay, giving mid-ranked documents more influence. Use when your retrievers are noisy or when you want to cast a wider net in the initial fusion.

Default $k = 60$ : Balances these trade-offs. It's aggressive enough to prioritize top matches but forgiving enough to surface good documents that rank well in only one retriever.

When to Tune k

While $k = 60$ is a robust default, consider tuning in these scenarios:

High-precision systems: If your use case prioritizes precision over recall (e.g., legal or medical retrieval where false positives are costly), try lower $k$ values (30-50) to more aggressively weight documents that rank high in both retrievers.

Diverse retrieval quality: If one retriever is significantly weaker than the other, a higher $k$ (70-90) can help surface good documents that only appear in the stronger retriever's results.

Small candidate sets: If you're only retrieving top-5 or top-10 results from each retriever, lower $k$ values (40-50) work better since all documents are already high-ranked.

Large candidate sets: If you're fusing top-100 results from each retriever, higher $k$ values (70-80) prevent over-concentration on the very top ranks.

How to Tune k

The best approach is to evaluate on a held-out test set with your specific retrievers and queries:

Start with $k = 60$ as your baseline
Try $k \in \{30, 40, 50, 60, 70, 80, 90\}$ on your validation set
Measure your target metric (e.g., NDCG@10, MRR, or precision@5)
Select the $k$ that performs best, but only switch from 60 if the gain is meaningful (e.g., >3-5% relative improvement)

In practice, $k$ tuning often yields modest gains, and the default of 60 is sufficient for most systems. However, tuning can be worthwhile if you have evaluation data and your use case has specific precision/recall requirements.

Choosing Your Approach

Here's how the main retrieval strategies compare on practical dimensions:

Aspect	BM25 Only	Semantic Only	Hybrid + RRF	Hybrid + Reranker
Setup	Minutes	Hours (embeddings)	Hours	Days (fine-tuning)
Infrastructure	CPU	GPU or API	CPU + GPU/API	GPU required
Latency	~1-5ms	~10-50ms	~15-60ms	~100-500ms
Cost	Low	Medium	Medium	High
Tuning needed	None	Embedding choice	Minimal (k)	Extensive
Accuracy	Good for keywords	Good for concepts	Better overall	Best (10-20% gain)
Explainability	High	Low	Medium	Low
Best for	Exact matches	Semantic queries	General purpose	High-stakes apps

Recommendation: Start with Hybrid + RRF. It covers the widest range of query types with minimal setup. Only move to rerankers if you have labeled data, GPU resources, and accuracy requirements that justify the added complexity.

Summary

Reciprocal Rank Fusion is the default choice for hybrid search because it:

Works out of the box — no normalization, minimal tuning ( $k = 60$ is a robust default)
Handles scale differences — BM25 and semantic scores don't need to be comparable
Is fast — <1ms fusion time vs 500ms+ for reranking
Is interpretable — you can see exactly which ranks contributed to the final score
Handles edge cases — missing results, different list lengths, partial matches

Weighted score fusion requires careful normalization and tuning, while rerankers require fine-tuning, labeled data, and GPU infrastructure. For most production RAG systems, RRF hits the sweet spot: simple, fast, and accurate enough.

Start with hybrid search using RRF ( $k = 60$ ). Only move to rerankers if you have labeled evaluation data, GPU resources, and accuracy requirements that justify the added complexity — and even then, consider using RRF as a first-stage filter before reranking.

References

Cormack, G. V., Clarke, C. L., & Büttcher, S. (2009). Reciprocal rank fusion outperforms condorcet and individual rank learning methods. Proceedings of the 32nd International ACM SIGIR Conference on Research and Development in Information Retrieval, 758-759. DOI: 10.1145/1571941.1572114 ↩ ↩² ↩³
Lin, J., Ma, X., Lin, S. C., Yang, J. H., Pradeep, R., & Nogueira, R. (2021). Pyserini: A Python toolkit for reproducible information retrieval research with sparse and dense representations. Proceedings of the 44th International ACM SIGIR Conference on Research and Development in Information Retrieval, 2356-2362. DOI: 10.1145/3404835.3463238 ↩
Nogueira, R., & Cho, K. (2019). Passage Re-ranking with BERT. arXiv preprint arXiv:1901.04085. Available at: https://arxiv.org/abs/1901.04085 ↩
Pradeep, R., Sharifymoghaddam, S., & Lin, J. (2023). RankVicuna: Zero-Shot Listwise Document Reranking with Open-Source Large Language Models. arXiv preprint arXiv:2309.15088. Available at: https://arxiv.org/abs/2309.15088 ↩