/terms/bm25 · 4 min read · advanced

BM25

BM25 (Best Matching 25) is a probabilistic ranking function used by classical search engines and the lexical layer of modern hybrid retrieval systems. It is the standard mechanism for scoring exact-keyword match in search retrieval; its application inside specific commercial AI search engines is not vendor-documented but is consistent with observable lexical-signal behavior.

Citation status

ChatGPT·PerplexityClaudeCopilotGemini

Last checked 2026-06-22

What is BM25?

BM25 (Best Matching 25) is a probabilistic ranking function1 introduced in 1994 by Stephen Robertson and Steve Walker2 in the Okapi information retrieval system at City University London (with foundational contributions from Karen Spärck Jones). It scores documents against a query based on three signals: term frequency (does the document mention the query terms a lot, with diminishing returns past a small number of occurrences), inverse document frequency (are those terms rare across the corpus), and document length normalization (penalize matches in extremely long documents).

The formula has two tunable parameters: k1 (controls term-frequency saturation) and b (controls length normalization strength). Standard defaults are k1 in the 1.2-2.0 range (1.2 is the most cited single value) and b=0.75; engines tune these per-corpus.

Status in 2026

Still production-standard despite being 30+ years old. BM25 is the documented lexical layer of every major open-source and commercial search infrastructure platform whose retrieval architecture is published: Elasticsearch (default), OpenSearch (default), Solr (default), Azure AI Search (default), and Lucene (the underlying library that several of those build on). Commercial AI search engines (Perplexity, Microsoft Copilot, Claude search, Google AI Mode) have not officially documented their retrieval architectures, but observable behavior on lexical-signal queries (acronyms, proper nouns, exact phrases) is consistent with hybrid retrieval that includes BM25 or BM25-style lexical scoring in the stack. The 2024-2026 AI search wave reinforced BM25's relevance: pure-vector retrieval tends to underperform on queries with strong lexical signals, as documented in the BEIR benchmark3 (Thakur et al. 2021), which evaluates retrieval methods across diverse tasks and shows BM25 remains competitive or superior on many of them. Hybrid systems treat BM25 as a permanent floor for these query types.

Note on this entry's territory (paired with the vector embeddings entry as the semantic counterpart): BM25 as an algorithm is vendor-canonical: Robertson & Walker 1994 plus Robertson & Zaragoza 2009 plus detailed implementation documentation from Elasticsearch, OpenSearch, Solr, Lucene, and Azure AI Search. The application to specific commercial AI search engines is non-vendor-canonical because those engines do not publish their retrieval architectures. The content-side application (exact-term usage, concept density, focused passages) sits in practitioner-discipline territory: writers can directly measure exact-string match presence in their own content without needing vendor-confirmed retrieval mechanisms. Paired with vector embeddings: BM25 covers the lexical layer of hybrid retrieval (exact-string match); vector embeddings cover the semantic layer (intent match). Together they describe the two foundational components hybrid retrieval combines.

How to apply

You do not tune BM25 (engines do), but your content writes the input BM25 scores. Three writing-side levers:

  • Use the precise terms your audience uses: BM25 rewards exact-string match. If your audience searches "BM25 algorithm" and you write "the probabilistic ranking function" throughout, you lose the lexical signal even when semantic models match.
  • Concentrate the dominant concept where chunking is likely to land: at the document level, BM25 counts term frequency regardless of position; front-loading vs back-loading the same word produces the same document-level TF score. What front-loading actually helps with is chunk-level retrieval: most production RAG systems chunk documents at fixed token boundaries (~200-1024 tokens; see passage-level optimization), so concepts placed near the start of a section are (a) less likely to be truncated by chunk boundaries and (b) more likely to be present in the chunk that wins retrieval. "Front-loading helps BM25" is therefore a chunking effect, not a position-weighting property of BM25 itself. This matches the same self-aware correction the hybrid retrieval entry applies to embedding models.
  • Avoid unnecessary length-padding, with nuance: BM25's length-normalization (the b parameter, typically 0.75) discounts documents longer than the corpus average for the same number of term hits. A 300-word focused page with the query term appearing 3 times typically outscores a 3000-word page with the same 3 occurrences because of the length penalty. Note: if the longer page genuinely contains 30 occurrences of the term, the higher TF can compensate for the length penalty; the lesson is concentration, not absolute brevity.

What to skip: keyword-stuffing. BM25 saturates term frequency (via the k1 parameter). Past a small number of occurrences, additional repetitions add little to the score and risk triggering anti-spam filters at higher layers.

How it relates to other concepts

  • Lexical-layer component of hybrid retrieval, paired with vector embeddings in production.
  • Lexical-layer counterpart to vector embeddings: BM25 handles exact-string match; embeddings handle intent match. Together they describe the two foundational components hybrid retrieval combines.
  • A common lexical component in some RAG and hybrid retrieval pipelines; production RAG systems can also use pure-vector, pure-keyword, or other retrieval methods depending on the application.
  • Per-chunk scoring mechanism for sub-document retrieval in classical search backends.

Footnotes

  1. Robertson & Zaragoza. "The Probabilistic Relevance Framework: BM25 and Beyond." Foundations and Trends in Information Retrieval, 2009. The canonical retrospective on BM25's development, formula derivation, and parameter tuning. This is the standard citation for the modern formulation; for the primary source see Robertson & Walker 1994 below.

  2. Robertson & Walker. "Some Simple Effective Approximations to the 2-Poisson Model for Probabilistic Weighted Retrieval." In Proceedings of SIGIR 1994. The original paper introducing the BM25 ranking function in the Okapi information retrieval system at City University London. Robertson & Zaragoza (2009) is the canonical retrospective; this is the primary source.

  3. Thakur et al. "BEIR: A Heterogeneous Benchmark for Zero-shot Evaluation of Information Retrieval Models." arXiv:2104.08663, April 2021. A standard benchmark that evaluates retrieval methods (including BM25, dense retrievers, late-interaction models, and hybrid combinations) across diverse tasks; BM25 remains competitive or superior on many of them, supporting the "BM25 floor" framing in hybrid retrieval stacks.

Part of Retrieval pipeline· editorial cluster, not a semantic link

Cluster pillar: Retrieval pipeline

Also in this cluster: Agentic retrieval · Chunking · Context assembly · Deep research mode · Generative search index · +11 more

Mentioned in· auto-generated from other terms' related lists

FAQ

Is BM25 still relevant in the AI search era?
Yes, very much. BM25 is the default lexical layer of every major open-source and commercial search infrastructure platform that publishes its retrieval architecture (Elasticsearch, OpenSearch, Solr, Azure AI Search, Lucene). Commercial AI search engines have not officially documented their retrieval architectures, but observable behavior on lexical-signal queries is consistent with hybrid retrieval that includes BM25-style lexical scoring. Pure semantic retrieval tends to underperform hybrid on common queries (per the BEIR benchmark), so BM25 has become more entrenched in the production stack, not less, with the AI search shift.
Do I need to tune BM25 parameters as a content publisher?
No. BM25 has two tunable parameters (k1 and b) that engines set internally. Your levers are at the content layer: exact-term usage, concept density, term proximity.
Why does BM25 beat newer methods on some queries?
Queries with strong lexical signals (acronyms, proper nouns, exact phrases) have an inherent floor that semantic models struggle to match. Engines run BM25 alongside embeddings precisely to inherit this floor.

Sources & further reading

Get the monthly digest

New terms shipped that week, plus one observation from the AI-citation tracker.

More about what you'll get