BM25 has been the backbone of text search and retrieval for over 30 years. It's the default ranking algorithm in Elasticsearch, OpenSearch, Lucene, and most production search systems. When you search documentation, e-commerce catalogs, or internal knowledge bases — BM25 is usually what's ranking the results.

In the age of embeddings and LLMs, you might expect BM25 to be obsolete. It's not. In fact, it's become more important.

What Kind of Search is BM25?

BM25 is a lexical (or keyword-based) search algorithm. It works by matching the exact words in your query against the exact words in documents. No machine learning, no neural networks, no understanding of meaning — just word matching with smart statistics.

This puts it in contrast with semantic search, which uses embedding models to match based on meaning. Semantic search can understand that "authentication" and "log in" are related concepts; BM25 cannot. But BM25 can find the exact document containing ECONNREFUSED when semantic search returns vaguely related "connection error" results.

Approach	How it works	Strengths	Weaknesses
Lexical (BM25)	Matches exact words	Precise keyword matching, fast, interpretable	Misses synonyms and paraphrases
Semantic (Embeddings)	Matches meaning via vectors	Understands related concepts	Loses exact match precision

Neither approach is strictly better — they complement each other. This is why most production systems use both.

Why BM25 Still Matters

Embedding models are powerful, but they have a fundamental weakness: they compress meaning into fixed-dimension vectors. That compression loses information — especially for exact matches.

When a user searches for ECONNREFUSED, k8s, or error code 429, they need the document containing that exact string. Embeddings might return semantically related content ("connection errors", "Kubernetes", "rate limiting"), but miss the specific match. BM25 won't.

This is why production RAG systems almost always use hybrid retrieval — combining BM25's lexical precision with embedding models' semantic understanding. Neither alone is sufficient.

What makes BM25 so durable?

No training required — works out of the box on any corpus, any language
Interpretable — you can explain exactly why a document ranked high
Fast — scales to billions of documents with inverted indexes
Battle-tested — 30+ years of research and optimization

The algorithm itself is simple. It builds on intuitions that humans share about relevance: rare words matter more than common ones, repetition helps but with limits, and longer documents shouldn't win just because they're longer.

The Three Ideas Behind BM25

BM25 combines three intuitions about what makes a document relevant to a query. Each addresses a different problem with naive keyword matching:

Rare words are stronger signals — if "the" matches, who cares? If "ECONNREFUSED" matches, pay attention.
Repetition has diminishing returns — mentioning a term 10 times isn't 10x better than once.
Document length shouldn't dominate — a 10-page doc matching once isn't better than a 1-paragraph doc matching once.

Let's look at each in detail.

1. Rare words matter more (IDF)

If a search term appears in every document, it tells you nothing. If it appears in only one document, that's a strong signal.

BM25 weights each query term by its inverse document frequency — rare terms get high weights, common terms get low weights.

Term	Appears in...	IDF Weight
"the"	100% of docs	~0 (ignored)
"api"	60% of docs	0.51
"authentication"	20% of docs	1.61
"oauth"	1 doc only	2.30

IDF Weight vs Document Frequency

The curve shows the inverse relationship clearly: terms appearing in just 1-2 documents get high weights (strong signal), while terms appearing in most documents get low weights (weak signal).

2. More occurrences help, but with limits (TF Saturation)

A document mentioning "authentication" twice is probably more relevant than one mentioning it once. But ten times? That's not 10x more relevant — it's probably just a longer document.

BM25 uses a saturation curve so the first few occurrences of a term boost the score significantly, but additional occurrences contribute less and less.

Term Frequency Saturation (varying k₁)

The $k_1$ parameter controls how fast the curve saturates. Higher values (1.5–2.0) allow more credit for repeated terms — useful for long documents where repetition matters. Lower values (0.5–1.0) saturate faster, treating presence as more important than frequency — useful when you care more about "does it contain this term?" than "how many times?" The default $k_1 = 1.2$ balances these concerns.

3. Long documents shouldn't win by default (Length Normalization)

Longer documents contain more words, so they naturally match more query terms. Without adjustment, a 10-page document would almost always beat a focused 1-paragraph answer.

BM25 normalizes by document length — documents longer than average get a slight penalty, shorter documents get a slight boost.

How the penalty works: BM25 uses a normalization factor that appears in the denominator of the scoring formula. This means:

Larger factor (above 1.0) = penalty for longer docs → dividing by a larger number reduces the score
Smaller factor (below 1.0) = boost for shorter docs → dividing by a smaller number increases the score
Factor of 1.0 = no adjustment (document is average length)

Length Normalization Factor (varying b)

The $b$ parameter controls normalization strength. At $b=0$ , length has no effect — useful for corpora where documents are similar lengths (tweets, titles). At $b=1$ , the penalty/bonus is fully proportional to length ratio. The default $b=0.75$ provides moderate normalization. Lower it (0.3–0.5) for short, uniform documents; raise it (0.8–0.9) when document lengths vary widely.

The Formula (Reference)

For completeness, here's the full BM25 formula. You don't need to memorize this — the interactive demo below shows you what each piece does.

\text{score}(D, Q) = \sum_{i=1}^{n} \text{IDF}(q_i) \cdot \frac{f(q_i, D) \cdot (k_1 + 1)}{f(q_i, D) + k_1 \cdot \left(1 - b + b \cdot \frac{|D|}{\text{avgdl}}\right)}

Symbol	Meaning
$f(q_i, D)$	How many times query term $q_i$ appears in document $D$
$\text{IDF}(q_i)$	Inverse document frequency — rarity weight for term $q_i$
$\\|D\\|$	Length of document $D$ (word count)
$\text{avgdl}$	Average document length across the corpus
$k_1$	Saturation parameter (default: 1.2)
$b$	Length normalization strength (default: 0.75)

The defaults ( $k_1 = 1.2$ , $b = 0.75$ ) work well for most use cases.

Try It: BM25 in Action

Now see it work. Enter a query and watch how BM25 ranks the documents. Expand any result to see exactly how the score was calculated — which terms contributed, how IDF and TF combined, and how length normalization affected the final score.

k₁1.2

b0.75

10/10 docs match

+4 more documents

Score Contribution by Query Term: "how to authenticate API requests"

Rare terms like “authenticate” contribute more per occurrence than common terms like "to". This is why BM25 excels at finding documents with specific technical terms.

Try: "oauth" (rare term scores high) · "API" vs "API authentication" (multi-term works better) · expand a result to see term-by-term breakdown · scroll down to see how each term contributes to the score

The chart at the bottom of the demo shows how different query terms contribute to scores. Notice how rare terms contribute much more per occurrence than common terms — this is why BM25 excels at finding documents with specific technical terms.

BM25 in RAG Systems

In modern retrieval systems, BM25 plays one of three roles:

Role	When to use	Example
Primary retriever	Keyword matching is sufficient	Internal docs, error code lookup
Hybrid with embeddings	Need both precision and recall	Production RAG systems
Candidate generation	Fast pre-filter before reranking	High-volume search

BM25 Alone Isn't Enough

BM25 is lexical — it matches exact words, not meaning. For LLM applications, this creates real problems:

Limitation	Example	Impact
Vocabulary mismatch	User searches "LLM" but docs say "large language model"	Zero results for valid queries
No semantic understanding	"How do I log in?" won't match "Authentication Guide"	Misses relevant content
Short queries struggle	Single-word queries give little signal	Poor ranking with vague queries

Pure BM25 retrieval typically achieves 20-40% lower recall than hybrid approaches. For production RAG, you need both.

Hybrid Search: BM25 + Embeddings

The standard pattern for production RAG is hybrid retrieval — run BM25 and embedding search in parallel, then combine scores.

BM25 and embeddings fail in complementary ways:

Query type	BM25	Embeddings	Winner
Exact error code: `ECONNREFUSED`	Finds exact match	Returns "connection errors"	BM25
Natural language: "how do I log in"	No match for "Authentication"	Understands intent	Embeddings
Technical + semantic: "k8s pod restart"	Matches "pod", "restart"	Matches "Kubernetes" context	Both

What BM25 contributes to hybrid:

Exact keyword precision — catches the queries embeddings miss
Speed — BM25 with inverted indexes is fast; use it to pre-filter before expensive embedding comparisons
Interpretability — you can explain why a result matched

Combining BM25 and Embedding Scores

Running both retrievers is straightforward — the challenge is score fusion. BM25 scores and embedding similarity scores live on different scales, so you can't just add them. Here are the three main approaches:

1. Reciprocal Rank Fusion (RRF)

The most robust method. Instead of combining raw scores, RRF combines rankings.

How it works:

Each retriever returns a ranked list
RRF assigns a score based on rank position: $\frac{1}{k + \text{rank}}$
Sum the RRF scores from both retrievers

1def reciprocal_rank_fusion(bm25_results, embedding_results, k=60):
2    """
3    Combine results from BM25 and embedding search using RRF.
4
5    Args:
6        bm25_results: List of (doc_id, bm25_score) tuples, ranked
7        embedding_results: List of (doc_id, similarity_score) tuples, ranked
8        k: Constant to prevent over-weighting top ranks (default: 60)
9
10    Returns:
11        List of (doc_id, rrf_score) tuples, sorted by RRF score
12    """
13    rrf_scores = {}
14
15    # Score from BM25 rankings
16    for rank, (doc_id, _) in enumerate(bm25_results):
17        rrf_scores[doc_id] = rrf_scores.get(doc_id, 0) + 1 / (k + rank + 1)
18
19    # Score from embedding rankings
20    for rank, (doc_id, _) in enumerate(embedding_results):
21        rrf_scores[doc_id] = rrf_scores.get(doc_id, 0) + 1 / (k + rank + 1)
22
23    # Sort by combined RRF score
24    return sorted(rrf_scores.items(), key=lambda x: x[1], reverse=True)

When to use: RRF is the safest default. It's scale-independent, requires no tuning, and handles cases where one retriever returns nothing. Most production systems start here.

2. Weighted Score Combination

Normalize both scores to [0, 1], then combine with weights.

1def weighted_score_fusion(bm25_results, embedding_results, alpha=0.5):
2    """
3    Combine normalized scores with weighted average.
4
5    Args:
6        bm25_results: List of (doc_id, bm25_score) tuples
7        embedding_results: List of (doc_id, similarity_score) tuples
8        alpha: Weight for BM25 (0=embeddings only, 1=BM25 only)
9
10    Returns:
11        List of (doc_id, combined_score) tuples, sorted
12    """
13    # Normalize BM25 scores (min-max normalization)
14    bm25_dict = {doc_id: score for doc_id, score in bm25_results}
15    if bm25_dict:
16        max_bm25 = max(bm25_dict.values())
17        min_bm25 = min(bm25_dict.values())
18        range_bm25 = max_bm25 - min_bm25 if max_bm25 != min_bm25 else 1
19        bm25_dict = {
20            doc_id: (score - min_bm25) / range_bm25
21            for doc_id, score in bm25_dict.items()
22        }
23
24    # Normalize embedding scores (cosine similarity already in [-1, 1])
25    # Shift to [0, 1] if needed
26    emb_dict = {doc_id: score for doc_id, score in embedding_results}
27
28    # Combine scores
29    all_docs = set(bm25_dict.keys()) | set(emb_dict.keys())
30    combined = {
31        doc_id: alpha * bm25_dict.get(doc_id, 0) + (1 - alpha) * emb_dict.get(doc_id, 0)
32        for doc_id in all_docs
33    }
34
35    return sorted(combined.items(), key=lambda x: x[1], reverse=True)

When to use: When you have strong prior knowledge about your query distribution. For example:

α = 0.7–0.8 — technical documentation with lots of error codes, API references
α = 0.3–0.4 — conceptual content where semantic matching dominates
α = 0.5 — balanced (start here if unsure)

Tuning α: Use a validation set of real queries. Track precision@k for different α values and pick the best performer.

3. Separate Retrieval + Cross-Encoder Reranking

Retrieve candidates from both sources, pool them, then rerank with a cross-encoder.

1def retrieve_and_rerank(query, bm25_retriever, embedding_retriever,
2                        reranker, top_k=10, candidate_k=50):
3    """
4    Retrieve candidates from both sources, then rerank.
5
6    Args:
7        query: User query string
8        bm25_retriever: BM25 retriever instance
9        embedding_retriever: Embedding retriever instance
10        reranker: Cross-encoder model for reranking
11        top_k: Final number of results to return
12        candidate_k: Number of candidates to retrieve from each source
13
14    Returns:
15        Top-k documents after reranking
16    """
17    # Get candidates from both retrievers
18    bm25_candidates = bm25_retriever.retrieve(query, top_k=candidate_k)
19    emb_candidates = embedding_retriever.retrieve(query, top_k=candidate_k)
20
21    # Pool and deduplicate
22    all_candidates = {doc.id: doc for doc in bm25_candidates + emb_candidates}
23
24    # Rerank with cross-encoder
25    rerank_scores = reranker.predict(
26        [(query, doc.text) for doc in all_candidates.values()]
27    )
28
29    # Sort by reranker scores
30    ranked = sorted(
31        zip(all_candidates.values(), rerank_scores),
32        key=lambda x: x[1],
33        reverse=True
34    )
35
36    return [doc for doc, score in ranked[:top_k]]

When to use: When you need maximum accuracy and can afford the latency (cross-encoders are slow). Common pattern: use BM25+embeddings for fast candidate generation (top 50-100), then rerank down to top 10.

Tuning BM25 Parameters in Hybrid Systems

The default parameters ( $k_1 = 1.2$ , $b = 0.75$ ) work well for solo BM25, but hybrid systems may benefit from adjustments:

Scenario	Recommended $k_1$	Recommended $b$	Reasoning
Short documents (tweets, titles)	0.5–1.0	0.0–0.3	Presence matters more than frequency; minimal length variance
Mixed lengths (articles, docs)	1.2–1.5	0.7–0.8	Default works well; slight increase if embeddings handle semantics
Technical content (code, errors)	1.5–2.0	0.5–0.7	Repetition signals importance; moderate normalization
Long documents (papers, manuals)	1.0–1.2	0.8–0.9	Avoid over-crediting repetition; strong length penalty

In hybrid setups specifically:

Lower $k_1$ (0.8–1.0) if embeddings already capture semantic importance — BM25 becomes a "presence checker"
Lower $b$ (0.3–0.5) if embedding retriever already handles length bias
Keep defaults if using RRF, which is less sensitive to raw score magnitudes

Empirical tuning:

Create a validation set of 50–100 queries with known relevant documents
Grid search over $k_1 \in [0.5, 1.0, 1.2, 1.5, 2.0]$ and $b \in [0.3, 0.5, 0.75, 0.9]$
Measure recall@10 or MRR (mean reciprocal rank) for the hybrid system
Pick the combination that maximizes your metric

Tools supporting hybrid search:

Elasticsearch/OpenSearch — BM25 by default, add kNN for vectors
Pinecone — sparse (BM25-style) + dense vectors in one query
Weaviate — hybrid search with configurable alpha between BM25 and vectors

Most RAG frameworks (LangChain, LlamaIndex) provide ensemble retrievers that handle score fusion automatically.

Summary

BM25 ranks documents by combining three factors: IDF (rare terms matter more), TF saturation (diminishing returns for repetition), and length normalization (fair comparison across document sizes).

It's fast, interpretable, and requires no training — but it only matches exact words. For production RAG, use hybrid search: BM25 for keyword precision, embeddings for semantic understanding. The combination catches queries that either approach alone would miss.

BM25 Search in LLM Systems