Reciprocal Rank Fusion for Hybrid Search
A comprehensive guide to using Reciprocal Rank Fusion (RRF) to combine BM25 and semantic search results for production RAG systems.
Who This Guide Is For
This guide is designed for engineers building search and retrieval systems who want to understand how to combine keyword and semantic search effectively. It assumes you're already familiar with:
- BM25 or keyword search — how lexical matching works and when it's useful
- Embeddings and semantic search — using vector similarity for retrieval
- RAG (Retrieval-Augmented Generation) basics — the general pattern of retrieving context for LLMs
New to these concepts? That's completely fine! If you'd like to start with the fundamentals, we recommend reading our BM25 Search in LLM Systems guide first, which covers keyword search from the ground up. For embeddings and RAG, there are excellent introductory resources from OpenAI, Pinecone, and LangChain that explain these concepts in depth.
This guide focuses specifically on Reciprocal Rank Fusion (RRF) — a practical technique for combining multiple retrieval methods. If you're already working with both BM25 and semantic search and wondering how to merge their results, you're in the right place.
Hybrid search combines BM25 (keyword matching) and semantic search (embedding similarity) for better results. But combining them is tricky — their scores are on different scales.
Reciprocal Rank Fusion (RRF) solves this by combining rankings instead of scores1. No normalization needed, minimal tuning required. RRF has become widely adopted in production search systems due to its simplicity and effectiveness compared to score-based fusion methods.
BM25 vs. Semantic Search: When to Go Hybrid?
Before diving into RRF, it's worth understanding why you'd combine BM25 and semantic search in the first place. Each has distinct strengths:
BM25 excels at:
- Exact keyword matching (product names, error codes, technical terms)
- Queries where users know the precise terminology
- Speed and simplicity — runs on CPU, no ML models required
- Explainability — you can see exactly which terms matched
Semantic search excels at:
- Conceptual matching (finding "authentication" when user searches "login")
- Natural language queries and questions
- Handling synonyms and paraphrasing
- Discovering related content the user didn't know to search for
When hybrid makes sense: Most production RAG systems benefit from hybrid search because real queries mix both patterns. A query like "how to authenticate API requests" has strong keywords ("authenticate", "API") and semantic intent (the user wants authentication methods, not just documents containing those words).
If your queries are purely keyword-based (e.g., searching product SKUs), BM25 alone may suffice. If your queries are purely conceptual with varied phrasing, semantic search alone may work. But for most use cases — especially developer documentation, support systems, and knowledge bases — hybrid search consistently outperforms either method alone2.
What is Reciprocal Rank Fusion?
RRF combines multiple ranked lists by working with rank positions instead of scores. Documents that rank high in multiple lists get boosted.
How RRF Works
RRF assigns scores based on rank position: . Rank 1 gets the highest score, rank 2 gets slightly less, and so on. Documents appearing in multiple lists get their scores summed.
The parameter (default 60) controls how much top ranks dominate1. Lower = more weight to top positions. Higher = more gradual decay.
Simple Example: Imagine we have just two documents and two ranked lists:
- BM25 ranking: [Doc A, Doc B]
- Semantic ranking: [Doc B, Doc A]
Using :
- Doc A: Ranks #1 in BM25, #2 in semantic → RRF score =
- Doc B: Ranks #2 in BM25, #1 in semantic → RRF score =
Both documents get equal combined scores because they have symmetric rankings—each is #1 in one retriever and #2 in the other. This demonstrates how RRF treats both retrieval methods equally.
Score Decay by Rank: Here's how RRF scores decay across different rank positions. Try adjusting the k parameter to see how it affects the score distribution:
Try adjusting k: Lower values (20-40) favor top ranks more aggressively. Higher values (80-100) create more gradual decay. The default k=60 balances these trade-offs.
| Rank | RRF Score (k=60) |
|---|---|
| 1 | 0.0164 |
| 2 | 0.0161 |
| 3 | 0.0159 |
| 5 | 0.0154 |
| 10 | 0.0143 |
| 20 | 0.0125 |
| 50 | 0.0091 |
| 100 | 0.0063 |
RRF Score Decay by Rank Position (k=60)
Notice how the score decays rapidly for top ranks, then levels off. This means documents ranked in the top 10-20 positions contribute significantly more than documents ranked lower. Experiment with different k values to see how this decay pattern changes.
Why RRF Works Well
RRF solves the main problems with combining BM25 and semantic search:
Different score scales: BM25 scores (0-20) and semantic scores (0-1) aren't comparable. RRF uses ranks instead.
No normalization needed: Ranks are always comparable. Rank 1 means "best match" in both systems.
Handles missing results: If BM25 finds a document embeddings miss (or vice versa), RRF still works. Documents only contribute from lists where they appear.
Minimal tuning: Works well out of the box with . While can be adjusted, RRF is generally robust to its value1. Weighted fusion requires tuning a weight parameter, which needs validation data and iteration.
A Synthetic Example: RRF in Action
Let's see how RRF combines BM25 and semantic search results with a concrete example.
Query: "How do I authenticate API requests?"
Documents:
- Doc A: "Authentication with API keys is the simplest method..."
- Doc B: "OAuth 2.0 provides secure authentication for production..."
- Doc C: "Rate limiting protects the API from abuse..."
- Doc D: "Error handling for API requests includes 401 authentication failures..."
- Doc E: "Query optimization improves retrieval performance..."
Here's how BM25 and semantic search might rank these documents, and how RRF combines them:
| Document | BM25 Score | BM25 Rank | Semantic Score | Semantic Rank | RRF Score (k=60) | Final Rank |
|---|---|---|---|---|---|---|
| Doc A | 8.5 | 1 | 0.82 | 3 | 1/61 + 1/63 = 0.0323 | 1 |
| Doc B | 6.2 | 3 | 0.88 | 1 | 1/63 + 1/61 = 0.0323 | 1 (tie) |
| Doc C | 4.1 | 2 | 0.45 | 5 | 1/62 + 1/65 = 0.0315 | 3 |
| Doc D | 3.8 | 4 | 0.65 | 2 | 1/64 + 1/62 = 0.0317 | 2 |
| Doc E | 0.0 | — | 0.25 | 4 | 0 + 1/64 = 0.0156 | 4 |
Understanding the Rankings:
-
Doc A ranks #1 in BM25 because it contains the exact phrase "authentication with API keys" — strong keyword match. It ranks #3 in semantic search because the meaning is relevant but slightly less central than OAuth content.
-
Doc B ranks #1 in semantic search because "OAuth 2.0 provides secure authentication" captures the semantic intent of the query better, even though it doesn't use "API" as prominently. It ranks #3 in BM25 due to fewer keyword matches.
-
Doc A and Doc B have identical RRF scores (0.0323) because they have symmetric ranks: one is (1,3) and the other is (3,1). This is intentional — RRF treats both retrieval methods equally. Ties like this require a secondary tie-breaker in your implementation (e.g., original score, recency, or random selection).
-
Doc D outranks Doc C despite Doc C ranking higher in BM25, because Doc D's strong semantic rank (#2) gives it a higher combined score.
-
Doc E only appears in semantic results (BM25 score is 0, so it gets no BM25 contribution). It still receives an RRF score from its semantic rank, but ranks lowest overall.
Notice how RRF naturally balances the two retrieval methods. Documents that rank well in both lists get boosted, while documents that only rank well in one list still contribute but don't dominate.
Interactive Demo: Explore RRF with Real Queries
The interactive demo below lets you experiment with different queries and k values to see how RRF fusion works in practice. Try queries like "rate limiting" (keyword-heavy), "secure my app" (semantic), or "OAuth token" (both) to see how BM25 and semantic rankings combine.
BM25 Rankings
Semantic Rankings
RRF Fused
Why Weighted Score Fusion Falls Short
Weighted fusion normalizes scores then combines them with a weighted average. But this has problems:
Score distributions differ: BM25 scores are skewed (most low, few high), while semantic scores cluster around 0.6-0.8. Normalization loses this information.
Normalization breaks: Score ranges change as your corpus grows or you switch embedding models. You'd need constant re-normalization.
Hard to tune: Finding the right weight requires validation data, grid search, and re-tuning as queries change. And one weight doesn't fit all queries.
Edge cases fail: What if one retriever returns no results? Or far fewer? Normalization breaks down.
RRF avoids all this by using ranks instead of scores.
Why Rerankers Are Complex
Rerankers can achieve 10-20% higher accuracy than RRF in some benchmarks34, but they're much more complex:
Slow: 10-50ms per document means 500ms-2.5 seconds for top-50 candidates. RRF takes <1ms.
Need fine-tuning: Pre-trained models work okay, but you need to fine-tune on your domain, task, and metrics. This requires labeled data, GPUs, and ongoing re-training.
Expensive: GPU inference, infrastructure, and maintenance costs add up. The accuracy gain often isn't worth 100× the cost.
Black boxes: Can't explain why a document ranked high. RRF is interpretable — you can see it ranked #2 in BM25 and #3 in semantic search.
Rerankers make sense when latency isn't critical, accuracy is paramount, and you have labeled data and GPU resources. For most systems, RRF is the sweet spot: fast, simple, and accurate enough.
Implementing RRF
RRF is simple: assign scores based on rank (), sum scores for documents in multiple lists, then sort.
1from collections import defaultdict23def reciprocal_rank_fusion(4 ranked_lists: list[list[str]],5 k: int = 606) -> list[tuple[str, float]]:7 """8 Combine multiple ranked lists using Reciprocal Rank Fusion.910 Args:11 ranked_lists: List of ranked document ID lists. Each inner list12 is ordered by relevance (index 0 = rank 1).13 k: RRF parameter controlling score decay. Default 60.1415 Returns:16 List of (doc_id, rrf_score) tuples sorted by score descending.17 """18 rrf_scores = defaultdict(float)1920 for ranked_list in ranked_lists:21 # start=1 because ranks start at 1, not 0 (first element is rank 1)22 for rank, doc_id in enumerate(ranked_list, start=1):23 rrf_scores[doc_id] += 1 / (k + rank)2425 # Sort by score descending26 return sorted(rrf_scores.items(), key=lambda x: x[1], reverse=True)272829# Example usage30bm25_results = ["doc_a", "doc_c", "doc_b", "doc_d"]31semantic_results = ["doc_b", "doc_d", "doc_a", "doc_e"]3233fused = reciprocal_rank_fusion([bm25_results, semantic_results], k=60)34# Returns: [('doc_a', 0.0323), ('doc_b', 0.0323), ('doc_d', 0.0317), ...]
Tuning RRF
The default works well for most use cases and is widely adopted across production systems. The parameter controls how aggressively the RRF score decays as rank position increases.
Understanding the k Parameter
Lower (20-40): Concentrates weight on top-ranked documents. Use when your retrievers are highly accurate and you want to strongly favor documents that rank well in both systems.
Higher (80-100): Creates more gradual decay, giving mid-ranked documents more influence. Use when your retrievers are noisy or when you want to cast a wider net in the initial fusion.
Default : Balances these trade-offs. It's aggressive enough to prioritize top matches but forgiving enough to surface good documents that rank well in only one retriever.
When to Tune k
While is a robust default, consider tuning in these scenarios:
High-precision systems: If your use case prioritizes precision over recall (e.g., legal or medical retrieval where false positives are costly), try lower values (30-50) to more aggressively weight documents that rank high in both retrievers.
Diverse retrieval quality: If one retriever is significantly weaker than the other, a higher (70-90) can help surface good documents that only appear in the stronger retriever's results.
Small candidate sets: If you're only retrieving top-5 or top-10 results from each retriever, lower values (40-50) work better since all documents are already high-ranked.
Large candidate sets: If you're fusing top-100 results from each retriever, higher values (70-80) prevent over-concentration on the very top ranks.
How to Tune k
The best approach is to evaluate on a held-out test set with your specific retrievers and queries:
- Start with as your baseline
- Try on your validation set
- Measure your target metric (e.g., NDCG@10, MRR, or precision@5)
- Select the that performs best, but only switch from 60 if the gain is meaningful (e.g., >3-5% relative improvement)
In practice, tuning often yields modest gains, and the default of 60 is sufficient for most systems. However, tuning can be worthwhile if you have evaluation data and your use case has specific precision/recall requirements.
Choosing Your Approach
Here's how the main retrieval strategies compare on practical dimensions:
| Aspect | BM25 Only | Semantic Only | Hybrid + RRF | Hybrid + Reranker |
|---|---|---|---|---|
| Setup | Minutes | Hours (embeddings) | Hours | Days (fine-tuning) |
| Infrastructure | CPU | GPU or API | CPU + GPU/API | GPU required |
| Latency | ~1-5ms | ~10-50ms | ~15-60ms | ~100-500ms |
| Cost | Low | Medium | Medium | High |
| Tuning needed | None | Embedding choice | Minimal (k) | Extensive |
| Accuracy | Good for keywords | Good for concepts | Better overall | Best (10-20% gain) |
| Explainability | High | Low | Medium | Low |
| Best for | Exact matches | Semantic queries | General purpose | High-stakes apps |
Recommendation: Start with Hybrid + RRF. It covers the widest range of query types with minimal setup. Only move to rerankers if you have labeled data, GPU resources, and accuracy requirements that justify the added complexity.
Summary
Reciprocal Rank Fusion is the default choice for hybrid search because it:
- Works out of the box — no normalization, minimal tuning ( is a robust default)
- Handles scale differences — BM25 and semantic scores don't need to be comparable
- Is fast — <1ms fusion time vs 500ms+ for reranking
- Is interpretable — you can see exactly which ranks contributed to the final score
- Handles edge cases — missing results, different list lengths, partial matches
Weighted score fusion requires careful normalization and tuning, while rerankers require fine-tuning, labeled data, and GPU infrastructure. For most production RAG systems, RRF hits the sweet spot: simple, fast, and accurate enough.
Start with hybrid search using RRF (). Only move to rerankers if you have labeled evaluation data, GPU resources, and accuracy requirements that justify the added complexity — and even then, consider using RRF as a first-stage filter before reranking.
References
Footnotes
-
Cormack, G. V., Clarke, C. L., & Büttcher, S. (2009). Reciprocal rank fusion outperforms condorcet and individual rank learning methods. Proceedings of the 32nd International ACM SIGIR Conference on Research and Development in Information Retrieval, 758-759. DOI: 10.1145/1571941.1572114 ↩ ↩2 ↩3
-
Lin, J., Ma, X., Lin, S. C., Yang, J. H., Pradeep, R., & Nogueira, R. (2021). Pyserini: A Python toolkit for reproducible information retrieval research with sparse and dense representations. Proceedings of the 44th International ACM SIGIR Conference on Research and Development in Information Retrieval, 2356-2362. DOI: 10.1145/3404835.3463238 ↩
-
Nogueira, R., & Cho, K. (2019). Passage Re-ranking with BERT. arXiv preprint arXiv:1901.04085. Available at: https://arxiv.org/abs/1901.04085 ↩
-
Pradeep, R., Sharifymoghaddam, S., & Lin, J. (2023). RankVicuna: Zero-Shot Listwise Document Reranking with Open-Source Large Language Models. arXiv preprint arXiv:2309.15088. Available at: https://arxiv.org/abs/2309.15088 ↩
Cite this post
Cole Hoffer. (Jan 2026). Reciprocal Rank Fusion for Hybrid Search. Cole Hoffer. https://www.colehoffer.ai/articles/reciprocal-rank-fusion-for-hybrid-search