GuidesBM25 Search in LLM Systems
GuideUpdated January 3, 2026

BM25 Search in LLM Systems

A comprehensive, interactive guide to understanding the BM25 ranking algorithm for AI engineers building RAG systems and search applications.

BM25 has been the backbone of text search and retrieval for over 30 years. It's the default ranking algorithm in Elasticsearch, OpenSearch, Lucene, and most production search systems. When you search documentation, e-commerce catalogs, or internal knowledge bases — BM25 is usually what's ranking the results.

In the age of embeddings and LLMs, you might expect BM25 to be obsolete. It's not. In fact, it's become more important.

What Kind of Search is BM25?

BM25 is a lexical (or keyword-based) search algorithm. It works by matching the exact words in your query against the exact words in documents. No machine learning, no neural networks, no understanding of meaning — just word matching with smart statistics.

This puts it in contrast with semantic search, which uses embedding models to match based on meaning. Semantic search can understand that "authentication" and "log in" are related concepts; BM25 cannot. But BM25 can find the exact document containing ECONNREFUSED when semantic search returns vaguely related "connection error" results.

ApproachHow it worksStrengthsWeaknesses
Lexical (BM25)Matches exact wordsPrecise keyword matching, fast, interpretableMisses synonyms and paraphrases
Semantic (Embeddings)Matches meaning via vectorsUnderstands related conceptsLoses exact match precision

Neither approach is strictly better — they complement each other. This is why most production systems use both.

Why BM25 Still Matters

Embedding models are powerful, but they have a fundamental weakness: they compress meaning into fixed-dimension vectors. That compression loses information — especially for exact matches.

When a user searches for ECONNREFUSED, k8s, or error code 429, they need the document containing that exact string. Embeddings might return semantically related content ("connection errors", "Kubernetes", "rate limiting"), but miss the specific match. BM25 won't.

This is why production RAG systems almost always use hybrid retrieval — combining BM25's lexical precision with embedding models' semantic understanding. Neither alone is sufficient.

What makes BM25 so durable?

  • No training required — works out of the box on any corpus, any language
  • Interpretable — you can explain exactly why a document ranked high
  • Fast — scales to billions of documents with inverted indexes
  • Battle-tested — 30+ years of research and optimization

The algorithm itself is simple. It builds on intuitions that humans share about relevance: rare words matter more than common ones, repetition helps but with limits, and longer documents shouldn't win just because they're longer.

The Three Ideas Behind BM25

BM25 combines three intuitions about what makes a document relevant to a query. Each addresses a different problem with naive keyword matching:

  1. Rare words are stronger signals — if "the" matches, who cares? If "ECONNREFUSED" matches, pay attention.
  2. Repetition has diminishing returns — mentioning a term 10 times isn't 10x better than once.
  3. Document length shouldn't dominate — a 10-page doc matching once isn't better than a 1-paragraph doc matching once.

Let's look at each in detail.

1. Rare words matter more (IDF)

If a search term appears in every document, it tells you nothing. If it appears in only one document, that's a strong signal.

BM25 weights each query term by its inverse document frequency — rare terms get high weights, common terms get low weights.

TermAppears in...IDF Weight
"the"100% of docs~0 (ignored)
"api"60% of docs0.51
"authentication"20% of docs1.61
"oauth"1 doc only2.30

IDF Weight vs Document Frequency

The curve shows the inverse relationship clearly: terms appearing in just 1-2 documents get high weights (strong signal), while terms appearing in most documents get low weights (weak signal).

2. More occurrences help, but with limits (TF Saturation)

A document mentioning "authentication" twice is probably more relevant than one mentioning it once. But ten times? That's not 10x more relevant — it's probably just a longer document.

BM25 uses a saturation curve so the first few occurrences of a term boost the score significantly, but additional occurrences contribute less and less.

Term Frequency Saturation (varying k₁)

The k1k_1 parameter controls how fast the curve saturates. Higher values (1.5–2.0) allow more credit for repeated terms — useful for long documents where repetition matters. Lower values (0.5–1.0) saturate faster, treating presence as more important than frequency — useful when you care more about "does it contain this term?" than "how many times?" The default k1=1.2k_1 = 1.2 balances these concerns.

3. Long documents shouldn't win by default (Length Normalization)

Longer documents contain more words, so they naturally match more query terms. Without adjustment, a 10-page document would almost always beat a focused 1-paragraph answer.

BM25 normalizes by document length — documents longer than average get a slight penalty, shorter documents get a slight boost.

How the penalty works: BM25 uses a normalization factor that appears in the denominator of the scoring formula. This means:

  • Larger factor (above 1.0) = penalty for longer docs → dividing by a larger number reduces the score
  • Smaller factor (below 1.0) = boost for shorter docs → dividing by a smaller number increases the score
  • Factor of 1.0 = no adjustment (document is average length)

Length Normalization Factor (varying b)

The bb parameter controls normalization strength. At b=0b=0, length has no effect — useful for corpora where documents are similar lengths (tweets, titles). At b=1b=1, the penalty/bonus is fully proportional to length ratio. The default b=0.75b=0.75 provides moderate normalization. Lower it (0.3–0.5) for short, uniform documents; raise it (0.8–0.9) when document lengths vary widely.

The Formula (Reference)

For completeness, here's the full BM25 formula. You don't need to memorize this — the interactive demo below shows you what each piece does.

score(D,Q)=i=1nIDF(qi)f(qi,D)(k1+1)f(qi,D)+k1(1b+bDavgdl)\text{score}(D, Q) = \sum_{i=1}^{n} \text{IDF}(q_i) \cdot \frac{f(q_i, D) \cdot (k_1 + 1)}{f(q_i, D) + k_1 \cdot \left(1 - b + b \cdot \frac{|D|}{\text{avgdl}}\right)}
SymbolMeaning
f(qi,D)f(q_i, D)How many times query term qiq_i appears in document DD
IDF(qi)\text{IDF}(q_i)Inverse document frequency — rarity weight for term qiq_i
D\|D\|Length of document DD (word count)
avgdl\text{avgdl}Average document length across the corpus
k1k_1Saturation parameter (default: 1.2)
bbLength normalization strength (default: 0.75)

The defaults (k1=1.2k_1 = 1.2, b=0.75b = 0.75) work well for most use cases.

Try It: BM25 in Action

Now see it work. Enter a query and watch how BM25 ranks the documents. Expand any result to see exactly how the score was calculated — which terms contributed, how IDF and TF combined, and how length normalization affected the final score.

1.2
0.75
10/10 docs match
+4 more documents

Score Contribution by Query Term: "how to authenticate API requests"

Rare terms like “authenticate” contribute more per occurrence than common terms like "to". This is why BM25 excels at finding documents with specific technical terms.

Try: "oauth" (rare term scores high) · "API" vs "API authentication" (multi-term works better) · expand a result to see term-by-term breakdown · scroll down to see how each term contributes to the score

The chart at the bottom of the demo shows how different query terms contribute to scores. Notice how rare terms contribute much more per occurrence than common terms — this is why BM25 excels at finding documents with specific technical terms.

BM25 in RAG Systems

In modern retrieval systems, BM25 plays one of three roles:

RoleWhen to useExample
Primary retrieverKeyword matching is sufficientInternal docs, error code lookup
Hybrid with embeddingsNeed both precision and recallProduction RAG systems
Candidate generationFast pre-filter before rerankingHigh-volume search

BM25 Alone Isn't Enough

BM25 is lexical — it matches exact words, not meaning. For LLM applications, this creates real problems:

LimitationExampleImpact
Vocabulary mismatchUser searches "LLM" but docs say "large language model"Zero results for valid queries
No semantic understanding"How do I log in?" won't match "Authentication Guide"Misses relevant content
Short queries struggleSingle-word queries give little signalPoor ranking with vague queries

Pure BM25 retrieval typically achieves 20-40% lower recall than hybrid approaches. For production RAG, you need both.

Hybrid Search: BM25 + Embeddings

The standard pattern for production RAG is hybrid retrieval — run BM25 and embedding search in parallel, then combine scores.

BM25 and embeddings fail in complementary ways:

Query typeBM25EmbeddingsWinner
Exact error code: ECONNREFUSEDFinds exact matchReturns "connection errors"BM25
Natural language: "how do I log in"No match for "Authentication"Understands intentEmbeddings
Technical + semantic: "k8s pod restart"Matches "pod", "restart"Matches "Kubernetes" contextBoth

What BM25 contributes to hybrid:

  • Exact keyword precision — catches the queries embeddings miss
  • Speed — BM25 with inverted indexes is fast; use it to pre-filter before expensive embedding comparisons
  • Interpretability — you can explain why a result matched

Combining BM25 and Embedding Scores

Running both retrievers is straightforward — the challenge is score fusion. BM25 scores and embedding similarity scores live on different scales, so you can't just add them. Here are the three main approaches:

1. Reciprocal Rank Fusion (RRF)

The most robust method. Instead of combining raw scores, RRF combines rankings.

How it works:

  • Each retriever returns a ranked list
  • RRF assigns a score based on rank position: 1k+rank\frac{1}{k + \text{rank}}
  • Sum the RRF scores from both retrievers
1def reciprocal_rank_fusion(bm25_results, embedding_results, k=60):
2 """
3 Combine results from BM25 and embedding search using RRF.
4
5 Args:
6 bm25_results: List of (doc_id, bm25_score) tuples, ranked
7 embedding_results: List of (doc_id, similarity_score) tuples, ranked
8 k: Constant to prevent over-weighting top ranks (default: 60)
9
10 Returns:
11 List of (doc_id, rrf_score) tuples, sorted by RRF score
12 """
13 rrf_scores = {}
14
15 # Score from BM25 rankings
16 for rank, (doc_id, _) in enumerate(bm25_results):
17 rrf_scores[doc_id] = rrf_scores.get(doc_id, 0) + 1 / (k + rank + 1)
18
19 # Score from embedding rankings
20 for rank, (doc_id, _) in enumerate(embedding_results):
21 rrf_scores[doc_id] = rrf_scores.get(doc_id, 0) + 1 / (k + rank + 1)
22
23 # Sort by combined RRF score
24 return sorted(rrf_scores.items(), key=lambda x: x[1], reverse=True)

When to use: RRF is the safest default. It's scale-independent, requires no tuning, and handles cases where one retriever returns nothing. Most production systems start here.

2. Weighted Score Combination

Normalize both scores to [0, 1], then combine with weights.

1def weighted_score_fusion(bm25_results, embedding_results, alpha=0.5):
2 """
3 Combine normalized scores with weighted average.
4
5 Args:
6 bm25_results: List of (doc_id, bm25_score) tuples
7 embedding_results: List of (doc_id, similarity_score) tuples
8 alpha: Weight for BM25 (0=embeddings only, 1=BM25 only)
9
10 Returns:
11 List of (doc_id, combined_score) tuples, sorted
12 """
13 # Normalize BM25 scores (min-max normalization)
14 bm25_dict = {doc_id: score for doc_id, score in bm25_results}
15 if bm25_dict:
16 max_bm25 = max(bm25_dict.values())
17 min_bm25 = min(bm25_dict.values())
18 range_bm25 = max_bm25 - min_bm25 if max_bm25 != min_bm25 else 1
19 bm25_dict = {
20 doc_id: (score - min_bm25) / range_bm25
21 for doc_id, score in bm25_dict.items()
22 }
23
24 # Normalize embedding scores (cosine similarity already in [-1, 1])
25 # Shift to [0, 1] if needed
26 emb_dict = {doc_id: score for doc_id, score in embedding_results}
27
28 # Combine scores
29 all_docs = set(bm25_dict.keys()) | set(emb_dict.keys())
30 combined = {
31 doc_id: alpha * bm25_dict.get(doc_id, 0) + (1 - alpha) * emb_dict.get(doc_id, 0)
32 for doc_id in all_docs
33 }
34
35 return sorted(combined.items(), key=lambda x: x[1], reverse=True)

When to use: When you have strong prior knowledge about your query distribution. For example:

  • α = 0.7–0.8 — technical documentation with lots of error codes, API references
  • α = 0.3–0.4 — conceptual content where semantic matching dominates
  • α = 0.5 — balanced (start here if unsure)

Tuning α: Use a validation set of real queries. Track precision@k for different α values and pick the best performer.

3. Separate Retrieval + Cross-Encoder Reranking

Retrieve candidates from both sources, pool them, then rerank with a cross-encoder.

1def retrieve_and_rerank(query, bm25_retriever, embedding_retriever,
2 reranker, top_k=10, candidate_k=50):
3 """
4 Retrieve candidates from both sources, then rerank.
5
6 Args:
7 query: User query string
8 bm25_retriever: BM25 retriever instance
9 embedding_retriever: Embedding retriever instance
10 reranker: Cross-encoder model for reranking
11 top_k: Final number of results to return
12 candidate_k: Number of candidates to retrieve from each source
13
14 Returns:
15 Top-k documents after reranking
16 """
17 # Get candidates from both retrievers
18 bm25_candidates = bm25_retriever.retrieve(query, top_k=candidate_k)
19 emb_candidates = embedding_retriever.retrieve(query, top_k=candidate_k)
20
21 # Pool and deduplicate
22 all_candidates = {doc.id: doc for doc in bm25_candidates + emb_candidates}
23
24 # Rerank with cross-encoder
25 rerank_scores = reranker.predict(
26 [(query, doc.text) for doc in all_candidates.values()]
27 )
28
29 # Sort by reranker scores
30 ranked = sorted(
31 zip(all_candidates.values(), rerank_scores),
32 key=lambda x: x[1],
33 reverse=True
34 )
35
36 return [doc for doc, score in ranked[:top_k]]

When to use: When you need maximum accuracy and can afford the latency (cross-encoders are slow). Common pattern: use BM25+embeddings for fast candidate generation (top 50-100), then rerank down to top 10.

Tuning BM25 Parameters in Hybrid Systems

The default parameters (k1=1.2k_1 = 1.2, b=0.75b = 0.75) work well for solo BM25, but hybrid systems may benefit from adjustments:

ScenarioRecommended k1k_1Recommended bbReasoning
Short documents (tweets, titles)0.5–1.00.0–0.3Presence matters more than frequency; minimal length variance
Mixed lengths (articles, docs)1.2–1.50.7–0.8Default works well; slight increase if embeddings handle semantics
Technical content (code, errors)1.5–2.00.5–0.7Repetition signals importance; moderate normalization
Long documents (papers, manuals)1.0–1.20.8–0.9Avoid over-crediting repetition; strong length penalty

In hybrid setups specifically:

  • Lower k1k_1 (0.8–1.0) if embeddings already capture semantic importance — BM25 becomes a "presence checker"
  • Lower bb (0.3–0.5) if embedding retriever already handles length bias
  • Keep defaults if using RRF, which is less sensitive to raw score magnitudes

Empirical tuning:

  1. Create a validation set of 50–100 queries with known relevant documents
  2. Grid search over k1[0.5,1.0,1.2,1.5,2.0]k_1 \in [0.5, 1.0, 1.2, 1.5, 2.0] and b[0.3,0.5,0.75,0.9]b \in [0.3, 0.5, 0.75, 0.9]
  3. Measure recall@10 or MRR (mean reciprocal rank) for the hybrid system
  4. Pick the combination that maximizes your metric

Tools supporting hybrid search:

  • Elasticsearch/OpenSearch — BM25 by default, add kNN for vectors
  • Pinecone — sparse (BM25-style) + dense vectors in one query
  • Weaviate — hybrid search with configurable alpha between BM25 and vectors

Most RAG frameworks (LangChain, LlamaIndex) provide ensemble retrievers that handle score fusion automatically.

Summary

BM25 ranks documents by combining three factors: IDF (rare terms matter more), TF saturation (diminishing returns for repetition), and length normalization (fair comparison across document sizes).

It's fast, interpretable, and requires no training — but it only matches exact words. For production RAG, use hybrid search: BM25 for keyword precision, embeddings for semantic understanding. The combination catches queries that either approach alone would miss.

Cite this post

/

Cole Hoffer. (Jan 2026). BM25 Search in LLM Systems. Cole Hoffer. https://www.colehoffer.ai/articles/bm25-search-in-llm-systems

Stay in the loop

Get notified when I publish new articles on RAG, search, and AI systems.