BM25 Search in LLM Systems
A comprehensive, interactive guide to understanding the BM25 ranking algorithm for AI engineers building RAG systems and search applications.
BM25 has been the backbone of text search and retrieval for over 30 years. It's the default ranking algorithm in Elasticsearch, OpenSearch, Lucene, and most production search systems. When you search documentation, e-commerce catalogs, or internal knowledge bases — BM25 is usually what's ranking the results.
In the age of embeddings and LLMs, you might expect BM25 to be obsolete. It's not. In fact, it's become more important.
What Kind of Search is BM25?
BM25 is a lexical (or keyword-based) search algorithm. It works by matching the exact words in your query against the exact words in documents. No machine learning, no neural networks, no understanding of meaning — just word matching with smart statistics.
This puts it in contrast with semantic search, which uses embedding models to match based on meaning. Semantic search can understand that "authentication" and "log in" are related concepts; BM25 cannot. But BM25 can find the exact document containing ECONNREFUSED when semantic search returns vaguely related "connection error" results.
| Approach | How it works | Strengths | Weaknesses |
|---|---|---|---|
| Lexical (BM25) | Matches exact words | Precise keyword matching, fast, interpretable | Misses synonyms and paraphrases |
| Semantic (Embeddings) | Matches meaning via vectors | Understands related concepts | Loses exact match precision |
Neither approach is strictly better — they complement each other. This is why most production systems use both.
Why BM25 Still Matters
Embedding models are powerful, but they have a fundamental weakness: they compress meaning into fixed-dimension vectors. That compression loses information — especially for exact matches.
When a user searches for ECONNREFUSED, k8s, or error code 429, they need the document containing that exact string. Embeddings might return semantically related content ("connection errors", "Kubernetes", "rate limiting"), but miss the specific match. BM25 won't.
This is why production RAG systems almost always use hybrid retrieval — combining BM25's lexical precision with embedding models' semantic understanding. Neither alone is sufficient.
What makes BM25 so durable?
- No training required — works out of the box on any corpus, any language
- Interpretable — you can explain exactly why a document ranked high
- Fast — scales to billions of documents with inverted indexes
- Battle-tested — 30+ years of research and optimization
The algorithm itself is simple. It builds on intuitions that humans share about relevance: rare words matter more than common ones, repetition helps but with limits, and longer documents shouldn't win just because they're longer.
The Three Ideas Behind BM25
BM25 combines three intuitions about what makes a document relevant to a query. Each addresses a different problem with naive keyword matching:
- Rare words are stronger signals — if "the" matches, who cares? If "ECONNREFUSED" matches, pay attention.
- Repetition has diminishing returns — mentioning a term 10 times isn't 10x better than once.
- Document length shouldn't dominate — a 10-page doc matching once isn't better than a 1-paragraph doc matching once.
Let's look at each in detail.
1. Rare words matter more (IDF)
If a search term appears in every document, it tells you nothing. If it appears in only one document, that's a strong signal.
BM25 weights each query term by its inverse document frequency — rare terms get high weights, common terms get low weights.
| Term | Appears in... | IDF Weight |
|---|---|---|
| "the" | 100% of docs | ~0 (ignored) |
| "api" | 60% of docs | 0.51 |
| "authentication" | 20% of docs | 1.61 |
| "oauth" | 1 doc only | 2.30 |
IDF Weight vs Document Frequency
The curve shows the inverse relationship clearly: terms appearing in just 1-2 documents get high weights (strong signal), while terms appearing in most documents get low weights (weak signal).
2. More occurrences help, but with limits (TF Saturation)
A document mentioning "authentication" twice is probably more relevant than one mentioning it once. But ten times? That's not 10x more relevant — it's probably just a longer document.
BM25 uses a saturation curve so the first few occurrences of a term boost the score significantly, but additional occurrences contribute less and less.
Term Frequency Saturation (varying k₁)
The parameter controls how fast the curve saturates. Higher values (1.5–2.0) allow more credit for repeated terms — useful for long documents where repetition matters. Lower values (0.5–1.0) saturate faster, treating presence as more important than frequency — useful when you care more about "does it contain this term?" than "how many times?" The default balances these concerns.
3. Long documents shouldn't win by default (Length Normalization)
Longer documents contain more words, so they naturally match more query terms. Without adjustment, a 10-page document would almost always beat a focused 1-paragraph answer.
BM25 normalizes by document length — documents longer than average get a slight penalty, shorter documents get a slight boost.
How the penalty works: BM25 uses a normalization factor that appears in the denominator of the scoring formula. This means:
- Larger factor (above 1.0) = penalty for longer docs → dividing by a larger number reduces the score
- Smaller factor (below 1.0) = boost for shorter docs → dividing by a smaller number increases the score
- Factor of 1.0 = no adjustment (document is average length)
Length Normalization Factor (varying b)
The parameter controls normalization strength. At , length has no effect — useful for corpora where documents are similar lengths (tweets, titles). At , the penalty/bonus is fully proportional to length ratio. The default provides moderate normalization. Lower it (0.3–0.5) for short, uniform documents; raise it (0.8–0.9) when document lengths vary widely.
The Formula (Reference)
For completeness, here's the full BM25 formula. You don't need to memorize this — the interactive demo below shows you what each piece does.
| Symbol | Meaning |
|---|---|
| How many times query term appears in document | |
| Inverse document frequency — rarity weight for term | |
| Length of document (word count) | |
| Average document length across the corpus | |
| Saturation parameter (default: 1.2) | |
| Length normalization strength (default: 0.75) |
The defaults (, ) work well for most use cases.
Try It: BM25 in Action
Now see it work. Enter a query and watch how BM25 ranks the documents. Expand any result to see exactly how the score was calculated — which terms contributed, how IDF and TF combined, and how length normalization affected the final score.
Score Contribution by Query Term: "how to authenticate API requests"
Rare terms like “authenticate” contribute more per occurrence than common terms like "to". This is why BM25 excels at finding documents with specific technical terms.
Try: "oauth" (rare term scores high) · "API" vs "API authentication" (multi-term works better) · expand a result to see term-by-term breakdown · scroll down to see how each term contributes to the score
The chart at the bottom of the demo shows how different query terms contribute to scores. Notice how rare terms contribute much more per occurrence than common terms — this is why BM25 excels at finding documents with specific technical terms.
BM25 in RAG Systems
In modern retrieval systems, BM25 plays one of three roles:
| Role | When to use | Example |
|---|---|---|
| Primary retriever | Keyword matching is sufficient | Internal docs, error code lookup |
| Hybrid with embeddings | Need both precision and recall | Production RAG systems |
| Candidate generation | Fast pre-filter before reranking | High-volume search |
BM25 Alone Isn't Enough
BM25 is lexical — it matches exact words, not meaning. For LLM applications, this creates real problems:
| Limitation | Example | Impact |
|---|---|---|
| Vocabulary mismatch | User searches "LLM" but docs say "large language model" | Zero results for valid queries |
| No semantic understanding | "How do I log in?" won't match "Authentication Guide" | Misses relevant content |
| Short queries struggle | Single-word queries give little signal | Poor ranking with vague queries |
Pure BM25 retrieval typically achieves 20-40% lower recall than hybrid approaches. For production RAG, you need both.
Hybrid Search: BM25 + Embeddings
The standard pattern for production RAG is hybrid retrieval — run BM25 and embedding search in parallel, then combine scores.
BM25 and embeddings fail in complementary ways:
| Query type | BM25 | Embeddings | Winner |
|---|---|---|---|
Exact error code: ECONNREFUSED | Finds exact match | Returns "connection errors" | BM25 |
| Natural language: "how do I log in" | No match for "Authentication" | Understands intent | Embeddings |
| Technical + semantic: "k8s pod restart" | Matches "pod", "restart" | Matches "Kubernetes" context | Both |
What BM25 contributes to hybrid:
- Exact keyword precision — catches the queries embeddings miss
- Speed — BM25 with inverted indexes is fast; use it to pre-filter before expensive embedding comparisons
- Interpretability — you can explain why a result matched
Combining BM25 and Embedding Scores
Running both retrievers is straightforward — the challenge is score fusion. BM25 scores and embedding similarity scores live on different scales, so you can't just add them. Here are the three main approaches:
1. Reciprocal Rank Fusion (RRF)
The most robust method. Instead of combining raw scores, RRF combines rankings.
How it works:
- Each retriever returns a ranked list
- RRF assigns a score based on rank position:
- Sum the RRF scores from both retrievers
1def reciprocal_rank_fusion(bm25_results, embedding_results, k=60):2 """3 Combine results from BM25 and embedding search using RRF.45 Args:6 bm25_results: List of (doc_id, bm25_score) tuples, ranked7 embedding_results: List of (doc_id, similarity_score) tuples, ranked8 k: Constant to prevent over-weighting top ranks (default: 60)910 Returns:11 List of (doc_id, rrf_score) tuples, sorted by RRF score12 """13 rrf_scores = {}1415 # Score from BM25 rankings16 for rank, (doc_id, _) in enumerate(bm25_results):17 rrf_scores[doc_id] = rrf_scores.get(doc_id, 0) + 1 / (k + rank + 1)1819 # Score from embedding rankings20 for rank, (doc_id, _) in enumerate(embedding_results):21 rrf_scores[doc_id] = rrf_scores.get(doc_id, 0) + 1 / (k + rank + 1)2223 # Sort by combined RRF score24 return sorted(rrf_scores.items(), key=lambda x: x[1], reverse=True)
When to use: RRF is the safest default. It's scale-independent, requires no tuning, and handles cases where one retriever returns nothing. Most production systems start here.
2. Weighted Score Combination
Normalize both scores to [0, 1], then combine with weights.
1def weighted_score_fusion(bm25_results, embedding_results, alpha=0.5):2 """3 Combine normalized scores with weighted average.45 Args:6 bm25_results: List of (doc_id, bm25_score) tuples7 embedding_results: List of (doc_id, similarity_score) tuples8 alpha: Weight for BM25 (0=embeddings only, 1=BM25 only)910 Returns:11 List of (doc_id, combined_score) tuples, sorted12 """13 # Normalize BM25 scores (min-max normalization)14 bm25_dict = {doc_id: score for doc_id, score in bm25_results}15 if bm25_dict:16 max_bm25 = max(bm25_dict.values())17 min_bm25 = min(bm25_dict.values())18 range_bm25 = max_bm25 - min_bm25 if max_bm25 != min_bm25 else 119 bm25_dict = {20 doc_id: (score - min_bm25) / range_bm2521 for doc_id, score in bm25_dict.items()22 }2324 # Normalize embedding scores (cosine similarity already in [-1, 1])25 # Shift to [0, 1] if needed26 emb_dict = {doc_id: score for doc_id, score in embedding_results}2728 # Combine scores29 all_docs = set(bm25_dict.keys()) | set(emb_dict.keys())30 combined = {31 doc_id: alpha * bm25_dict.get(doc_id, 0) + (1 - alpha) * emb_dict.get(doc_id, 0)32 for doc_id in all_docs33 }3435 return sorted(combined.items(), key=lambda x: x[1], reverse=True)
When to use: When you have strong prior knowledge about your query distribution. For example:
- α = 0.7–0.8 — technical documentation with lots of error codes, API references
- α = 0.3–0.4 — conceptual content where semantic matching dominates
- α = 0.5 — balanced (start here if unsure)
Tuning α: Use a validation set of real queries. Track precision@k for different α values and pick the best performer.
3. Separate Retrieval + Cross-Encoder Reranking
Retrieve candidates from both sources, pool them, then rerank with a cross-encoder.
1def retrieve_and_rerank(query, bm25_retriever, embedding_retriever,2 reranker, top_k=10, candidate_k=50):3 """4 Retrieve candidates from both sources, then rerank.56 Args:7 query: User query string8 bm25_retriever: BM25 retriever instance9 embedding_retriever: Embedding retriever instance10 reranker: Cross-encoder model for reranking11 top_k: Final number of results to return12 candidate_k: Number of candidates to retrieve from each source1314 Returns:15 Top-k documents after reranking16 """17 # Get candidates from both retrievers18 bm25_candidates = bm25_retriever.retrieve(query, top_k=candidate_k)19 emb_candidates = embedding_retriever.retrieve(query, top_k=candidate_k)2021 # Pool and deduplicate22 all_candidates = {doc.id: doc for doc in bm25_candidates + emb_candidates}2324 # Rerank with cross-encoder25 rerank_scores = reranker.predict(26 [(query, doc.text) for doc in all_candidates.values()]27 )2829 # Sort by reranker scores30 ranked = sorted(31 zip(all_candidates.values(), rerank_scores),32 key=lambda x: x[1],33 reverse=True34 )3536 return [doc for doc, score in ranked[:top_k]]
When to use: When you need maximum accuracy and can afford the latency (cross-encoders are slow). Common pattern: use BM25+embeddings for fast candidate generation (top 50-100), then rerank down to top 10.
Tuning BM25 Parameters in Hybrid Systems
The default parameters (, ) work well for solo BM25, but hybrid systems may benefit from adjustments:
| Scenario | Recommended | Recommended | Reasoning |
|---|---|---|---|
| Short documents (tweets, titles) | 0.5–1.0 | 0.0–0.3 | Presence matters more than frequency; minimal length variance |
| Mixed lengths (articles, docs) | 1.2–1.5 | 0.7–0.8 | Default works well; slight increase if embeddings handle semantics |
| Technical content (code, errors) | 1.5–2.0 | 0.5–0.7 | Repetition signals importance; moderate normalization |
| Long documents (papers, manuals) | 1.0–1.2 | 0.8–0.9 | Avoid over-crediting repetition; strong length penalty |
In hybrid setups specifically:
- Lower (0.8–1.0) if embeddings already capture semantic importance — BM25 becomes a "presence checker"
- Lower (0.3–0.5) if embedding retriever already handles length bias
- Keep defaults if using RRF, which is less sensitive to raw score magnitudes
Empirical tuning:
- Create a validation set of 50–100 queries with known relevant documents
- Grid search over and
- Measure recall@10 or MRR (mean reciprocal rank) for the hybrid system
- Pick the combination that maximizes your metric
Tools supporting hybrid search:
- Elasticsearch/OpenSearch — BM25 by default, add kNN for vectors
- Pinecone — sparse (BM25-style) + dense vectors in one query
- Weaviate — hybrid search with configurable alpha between BM25 and vectors
Most RAG frameworks (LangChain, LlamaIndex) provide ensemble retrievers that handle score fusion automatically.
Summary
BM25 ranks documents by combining three factors: IDF (rare terms matter more), TF saturation (diminishing returns for repetition), and length normalization (fair comparison across document sizes).
It's fast, interpretable, and requires no training — but it only matches exact words. For production RAG, use hybrid search: BM25 for keyword precision, embeddings for semantic understanding. The combination catches queries that either approach alone would miss.
Cite this post
Cole Hoffer. (Jan 2026). BM25 Search in LLM Systems. Cole Hoffer. https://www.colehoffer.ai/articles/bm25-search-in-llm-systems