BM25 Search in LLM Systems
A comprehensive, interactive guide to understanding the BM25 ranking algorithm for AI engineers building RAG systems and search applications.
BM25 has been the backbone of text search and retrieval for over 30 years. It's the default ranking algorithm in Elasticsearch, OpenSearch, Lucene, and most production search systems. When you search documentation, e-commerce catalogs, or internal knowledge bases, BM25 is usually what's ranking the results.
In the age of embeddings and LLMs, you might expect BM25 to be obsolete. It's not. In fact, it's become more important.
What Kind of Search is BM25?
BM25 is a lexical (or keyword-based) search algorithm. It works by matching the exact words in your query against the exact words in documents. No machine learning, no neural networks, no understanding of meaning. Just word matching with smart statistics.
This puts it in contrast with semantic search, which uses embedding models to match based on meaning. Semantic search can understand that "authentication" and "log in" are related concepts; BM25 cannot. But BM25 can find the exact document containing ECONNREFUSED when semantic search returns vaguely related "connection error" results.
| Approach | How it works | Strengths | Weaknesses |
|---|---|---|---|
| Lexical (BM25) | Matches exact words | Precise keyword matching, fast, interpretable | Misses synonyms and paraphrases |
| Semantic (Embeddings) | Matches meaning via vectors | Understands related concepts | Loses exact match precision |
Neither approach is strictly better; they complement each other. This is why most production systems use both.
Why BM25 Still Matters
Embedding models are powerful, but they have a fundamental weakness: they compress meaning into fixed-dimension vectors. That compression loses information, especially for exact matches.
When a user searches for ECONNREFUSED, k8s, or error code 429, they need the document containing that exact string. Embeddings might return semantically related content ("connection errors", "Kubernetes", "rate limiting"), but miss the specific match. BM25 won't.
This is why production RAG systems almost always use hybrid retrieval - combining BM25's lexical precision with embedding models' semantic understanding. Neither alone is sufficient.
What makes BM25 so durable?
- No training required - works out of the box on any corpus, any language
- Interpretable - you can explain exactly why a document ranked high
- Fast - scales to billions of documents with inverted indexes
- Battle-tested - 30+ years of research and optimization
The algorithm itself is simple. It builds on intuitions that humans share about relevance: rare words matter more than common ones, repetition helps but with limits, and longer documents shouldn't win just because they're longer.
The Three Ideas Behind BM25
BM25 combines three intuitions about what makes a document relevant to a query. Each addresses a different problem with naive keyword matching:
- Rare words are stronger signals - if "the" matches, who cares? If "ECONNREFUSED" matches, pay attention.
- Repetition has diminishing returns - mentioning a term 10 times isn't 10x better than once.
- Document length shouldn't dominate - a 10-page doc matching once isn't better than a 1-paragraph doc matching once.
Let's look at each in detail.
1. Rare words matter more (IDF)
If a search term appears in every document, it tells you nothing. If it appears in only one document, that's a strong signal.
BM25 weights each query term by its inverse document frequency - rare terms get high weights, common terms get low weights.
| Term | Appears in... | IDF Weight |
|---|---|---|
| "the" | 100% of docs | ~0 (ignored) |
| "api" | 60% of docs | 0.51 |
| "authentication" | 20% of docs | 1.61 |
| "oauth" | 1 doc only | 2.30 |
IDF Weight vs Document Frequency
The curve shows the inverse relationship clearly: terms appearing in just 1-2 documents get high weights (strong signal), while terms appearing in most documents get low weights (weak signal).
2. More occurrences help, but with limits (TF Saturation)
A document mentioning "authentication" twice is probably more relevant than one mentioning it once. But ten times? That's not 10x more relevant. It's probably just a longer document.
BM25 uses a saturation curve so the first few occurrences of a term boost the score significantly, but additional occurrences contribute less and less.
Term Frequency Saturation (varying k₁)
The parameter controls how fast the curve saturates. Higher values (1.5-2.0) allow more credit for repeated terms, which is useful for long documents where repetition matters. Lower values (0.5-1.0) saturate faster, treating presence as more important than frequency, which is useful when you care more about "does it contain this term?" than "how many times?" The default balances these concerns.
3. Long documents shouldn't win by default (Length Normalization)
Longer documents contain more words, so they naturally match more query terms. Without adjustment, a 10-page document would almost always beat a focused 1-paragraph answer.
BM25 normalizes by document length: documents longer than average get a slight penalty, shorter documents get a slight boost.
How the penalty works: BM25 uses a normalization factor that appears in the denominator of the scoring formula. This means:
- Larger factor (above 1.0) = penalty for longer docs → dividing by a larger number reduces the score
- Smaller factor (below 1.0) = boost for shorter docs → dividing by a smaller number increases the score
- Factor of 1.0 = no adjustment (document is average length)
Length Normalization Factor (varying b)
The parameter controls normalization strength. At , length has no effect, which is useful for corpora where documents are similar lengths (tweets, titles). At , the penalty/bonus is fully proportional to length ratio. The default provides moderate normalization. Lower it (0.3-0.5) for short, uniform documents; raise it (0.8-0.9) when document lengths vary widely.
The Formula (Reference)
For completeness, here's the full BM25 formula. You don't need to memorize this. The interactive demo below shows you what each piece does.
| Symbol | Meaning |
|---|---|
| How many times query term appears in document | |
| Inverse document frequency - rarity weight for term | |
| Length of document (word count) | |
| Average document length across the corpus | |
| Saturation parameter (default: 1.2) | |
| Length normalization strength (default: 0.75) |
The defaults (, ) work well for most use cases.
Try It: BM25 in Action
Now see it work. Enter a query and watch how BM25 ranks the documents. Expand any result to see exactly how the score was calculated: which terms contributed, how IDF and TF combined, and how length normalization affected the final score.
Score Contribution by Query Term: "how to authenticate API requests"
Rare terms like “authenticate” contribute more per occurrence than common terms like "to". This is why BM25 excels at finding documents with specific technical terms.
Try: "oauth" (rare term scores high) · "API" vs "API authentication" (multi-term works better) · expand a result to see term-by-term breakdown · scroll down to see how each term contributes to the score
The chart at the bottom of the demo shows how different query terms contribute to scores. Notice how rare terms contribute much more per occurrence than common terms. This is why BM25 excels at finding documents with specific technical terms.
BM25 in RAG Systems
In modern retrieval systems, BM25 plays one of three roles:
| Role | When to use | Example |
|---|---|---|
| Primary retriever | Keyword matching is sufficient | Internal docs, error code lookup |
| Hybrid with embeddings | Need both precision and recall | Production RAG systems |
| Candidate generation | Fast pre-filter before reranking | High-volume search |
BM25 Alone Isn't Enough
BM25 is lexical: it matches exact words, not meaning. For LLM applications, this creates real problems:
| Limitation | Example | Impact |
|---|---|---|
| Vocabulary mismatch | User searches "LLM" but docs say "large language model" | Zero results for valid queries |
| No semantic understanding | "How do I log in?" won't match "Authentication Guide" | Misses relevant content |
| Short queries struggle | Single-word queries give little signal | Poor ranking with vague queries |
Pure BM25 retrieval typically achieves 20-40% lower recall than hybrid approaches. For production RAG, you need both.
Hybrid Search: BM25 + Embeddings
The standard pattern for production RAG is hybrid retrieval - run BM25 and embedding search in parallel, then combine scores.
BM25 and embeddings fail in complementary ways:
| Query type | BM25 | Embeddings | Winner |
|---|---|---|---|
Exact error code: ECONNREFUSED | Finds exact match | Returns "connection errors" | BM25 |
| Natural language: "how do I log in" | No match for "Authentication" | Understands intent | Embeddings |
| Technical + semantic: "k8s pod restart" | Matches "pod", "restart" | Matches "Kubernetes" context | Both |
What BM25 contributes to hybrid:
- Exact keyword precision - catches the queries embeddings miss
- Speed - BM25 with inverted indexes is fast; use it to pre-filter before expensive embedding comparisons
- Interpretability - you can explain why a result matched
Combining BM25 and Embedding Scores
Running both retrievers is straightforward. The challenge is score fusion. BM25 scores and embedding similarity scores live on different scales, so you can't just add them. Here are the three main approaches:
1. Reciprocal Rank Fusion (RRF)
The most robust method. Instead of combining raw scores, RRF combines rankings.
How it works:
- Each retriever returns a ranked list
- RRF assigns a score based on rank position:
- Sum the RRF scores from both retrievers
When to use: RRF is the safest default. It's scale-independent, requires no tuning, and handles cases where one retriever returns nothing. Most production systems start here.
2. Weighted Score Combination
Normalize both scores to [0, 1], then combine with weights.
When to use: When you have strong prior knowledge about your query distribution. For example:
- α = 0.7-0.8 - technical documentation with lots of error codes, API references
- α = 0.3-0.4 - conceptual content where semantic matching dominates
- α = 0.5 - balanced (start here if unsure)
Tuning α: Use a validation set of real queries. Track precision@k for different α values and pick the best performer.
3. Separate Retrieval + Cross-Encoder Reranking
Retrieve candidates from both sources, pool them, then rerank with a cross-encoder.
When to use: When you need maximum accuracy and can afford the latency (cross-encoders are slow). Common pattern: use BM25+embeddings for fast candidate generation (top 50-100), then rerank down to top 10.
Tuning BM25 Parameters in Hybrid Systems
The default parameters (, ) work well for solo BM25, but hybrid systems may benefit from adjustments:
| Scenario | Recommended | Recommended | Reasoning |
|---|---|---|---|
| Short documents (tweets, titles) | 0.5-1.0 | 0.0-0.3 | Presence matters more than frequency; minimal length variance |
| Mixed lengths (articles, docs) | 1.2-1.5 | 0.7-0.8 | Default works well; slight increase if embeddings handle semantics |
| Technical content (code, errors) | 1.5-2.0 | 0.5-0.7 | Repetition signals importance; moderate normalization |
| Long documents (papers, manuals) | 1.0-1.2 | 0.8-0.9 | Avoid over-crediting repetition; strong length penalty |
In hybrid setups specifically:
- Lower (0.8-1.0) if embeddings already capture semantic importance, since BM25 becomes a "presence checker"
- Lower (0.3-0.5) if embedding retriever already handles length bias
- Keep defaults if using RRF, which is less sensitive to raw score magnitudes
Empirical tuning:
- Create a validation set of 50-100 queries with known relevant documents
- Grid search over and
- Measure recall@10 or MRR (mean reciprocal rank) for the hybrid system
- Pick the combination that maximizes your metric
Tools supporting hybrid search:
- Elasticsearch/OpenSearch - BM25 by default, add kNN for vectors
- Pinecone - sparse (BM25-style) + dense vectors in one query
- Weaviate - hybrid search with configurable alpha between BM25 and vectors
Most RAG frameworks (LangChain, LlamaIndex) provide ensemble retrievers that handle score fusion automatically.
Summary
BM25 ranks documents by combining three factors: IDF (rare terms matter more), TF saturation (diminishing returns for repetition), and length normalization (fair comparison across document sizes).
It's fast, interpretable, and requires no training, but it only matches exact words. For production RAG, use hybrid search: BM25 for keyword precision, embeddings for semantic understanding. The combination catches queries that either approach alone would miss.
Cite this post
Cole Hoffer. (Dec 2025). BM25 Search in LLM Systems. Cole Hoffer. https://www.colehoffer.ai/guides/bm25-search-in-llm-systems