/terms/chunking · 5 min read · intermediate
Chunking
Citation status
Last checked 2026-06-04
Chunking is the preprocessing step in a retrieval pipeline that splits a document into smaller segments (chunks, typically no more than a few hundred tokens each) so each can be embedded, indexed, and retrieved independently1. It runs before embedding and retrieval: in a RAG-style pipeline a long page is rarely retrieved whole, it is broken into chunks, and the chunk, not the page, is typically the unit a retrieval system scores and passes to the model.
Chunking strategies vary. Fixed-size windows split on a set token count, often with a small overlap between adjacent chunks so a sentence straddling a boundary is not lost. Structural splits follow paragraph, heading, or sentence boundaries. Semantic splits group sentences that are topically close. Each trades off differently between keeping a coherent idea intact and keeping chunks small enough to retrieve precisely. There is no single correct chunk size; it depends on the embedding model, the content, and the query patterns the system expects.
Status in 2026
Chunking is a standard, well-understood step in most production retrieval-augmented generation systems, especially those indexing long documents; the live question is not whether to chunk but how. The recognized failure mode is context loss: a chunk that reads clearly inside its original document can become uninterpretable in isolation. Anthropic's worked example is a chunk stating that revenue "grew by 3%" with no indication of which company or which quarter, which makes it both hard to retrieve for a specific query and hard for the model to use if it is retrieved1.
Several approaches emerged in 2024-2026 to counter context loss, and they are instructive precisely because they disagree. Contextual chunking (Anthropic) prepends a short description (roughly 50-100 tokens) of the chunk's place in its document before embedding; Anthropic reported retrieval-failure reductions of about 35% from contextual embeddings alone, about 49% combined with lexical BM25, and about 67% with reranking added, though these are one vendor's results on its own corpora, not a shared benchmark or a guaranteed lift1. Late chunking (Jina AI) takes almost the opposite route: embed the whole document with a long-context model first, then derive each chunk's embedding from the token-level representations, preserving cross-chunk context without a per-chunk LLM call and without additional training2. Semantic chunking splits on topical boundaries (embedding-similarity breakpoints) rather than a fixed token count. Chunking strategy is an active area; the common thread is keeping a chunk interpretable in isolation, and none of these is a knob the publisher turns.
The point that matters for publishers is control: chunk size, split boundaries, overlap, and whether context is prepended are all set by whoever operates the retrieval pipeline, not by the publisher whose content is chunked.
How to apply
The boundary between what a publisher controls and what the retrieval pipeline controls is the whole story here. You do not set the chunk size, choose the split points, or decide whether the system adds context to each chunk. What you control is whether each unit of your content still makes sense after it is cut out of the page. Three moves follow from that:
- Write self-contained passages. A paragraph that names its own subject, makes one clear claim, and does not depend on the sentence before or after it survives being isolated into a chunk. This is the same discipline behind passage-level optimization and sub-passage extraction; chunking is the pipeline step that makes this discipline pay off, because passages or chunks are often the unit retrieved.
- State the subject inside the passage, not only in the heading. Chunking frequently separates a paragraph from the heading above it, so a passage whose topic is only established by a distant
<h2>arrives context-stripped. Naming the subject in the passage itself is the manual, publisher-side approximation of what contextual chunking does automatically on the retrieval operator's side. - Do not chase a chunk size you cannot set. Because chunk size and boundaries are chosen by the pipeline, advice to "write in 300-token blocks" optimizes for a parameter you do not control and that varies by system. The durable investment is passage self-containment, which holds regardless of where the boundary falls.
What to skip:
- "Optimal chunk size for SEO" advice aimed at publishers. Chunk size is a retrieval-operator setting, not a publishing choice; the same page is chunked differently by different engines.
- Assuming the whole page is the retrieved unit. It almost never is; the chunk is. A page that only makes sense read top to bottom can retrieve poorly even when it ranks well in classic search.
How it relates to other concepts
- Upstream of vector embeddings and RAG: chunking produces the segments that get embedded and retrieved. The chunk is the atomic unit of the whole retrieve-then-generate flow, which is why decisions made at the chunking step propagate through everything downstream.
- The pipeline reason passage-level optimization and sub-passage extraction matter: both disciplines assume the retrieved unit is a passage, not a page. Chunking is the concrete step that makes the passage the unit.
- Distinct from sub-document retrieval: the two are adjacent steps in the same neighborhood, not synonyms. Chunking is the splitting step that produces the passage-sized units; sub-document retrieval is the step that indexes and selects among those units instead of returning whole documents. Chunking makes the unit; sub-document retrieval chooses one.
- Compounds with lost in the middle: chunking decides how your content is cut, and then the assembled context decides where the surviving chunk lands relative to the model's attention. Both are pipeline-controlled stages a publisher cannot directly set; self-contained passages are the response to both.
- Contextual chunking leans on BM25 and reranking: the reported gains came from contextualized chunks scored by lexical retrieval and then reranked, a reminder that chunking is one stage in a multi-stage hybrid retrieval system rather than a standalone lever.
Footnotes
-
Anthropic. "Introducing Contextual Retrieval." September 19, 2024. anthropic.com/news/contextual-retrieval. Describes standard RAG chunking as breaking a knowledge base into segments of "no more than a few hundred tokens" before embedding, and identifies context loss as the core failure mode (the SEC-filing example: a chunk reading "the company's revenue grew by 3% over the previous quarter" with no indication of which company or which quarter). Contextual Retrieval prepends a short chunk-specific explanation (typically 50-100 tokens) before embedding and indexing. Reported failure-rate reductions on Anthropic's evaluation: about 35% with contextual embeddings alone, about 49% combined with contextual BM25, and about 67% with reranking added; preprocessing cost approximately $1.02 per million document tokens using prompt caching. These are one vendor's results on its own corpora, not a cross-vendor benchmark. ↩ ↩2 ↩3
-
Günther, M., Mohr, I., Williams, D.J., Wang, B., and Xiao, H. (Jina AI). "Late Chunking: Contextual Chunk Embeddings Using Long-Context Embedding Models." arXiv:2409.04701, submitted September 7, 2024. Proposes embedding an entire long document with a long-context embedding model and applying chunking after the transformer pass (just before mean pooling), so each chunk embedding retains cross-chunk context, in contrast to the conventional chunk-first-then-embed order. The abstract states the method "works without additional training," with an optional dedicated fine-tuning approach to improve effectiveness. Available via the
late_chunkingparameter in jina-embeddings-v3. Distinct from Anthropic's contextual chunking, which prepends generated context text to each chunk before embedding; late chunking instead defers the split to preserve context without a per-chunk generation step. ↩
Part of Retrieval pipeline· editorial cluster, not a semantic link
Also in this cluster: Agentic retrieval · BM25 · Context assembly · Deep research mode · Generative search index · +10 more
Related terms
- Passage-level optimization/terms/passage-level-optimization
- Sub-passage extraction/terms/sub-passage-extraction
- Lost in the Middle/terms/lost-in-the-middle
- Vector embeddings/terms/vector-embeddings
- RAG (Retrieval-Augmented Generation)/terms/rag
- Hybrid retrieval/terms/hybrid-retrieval
- Reranking/terms/reranking
- Sub-document retrieval/terms/sub-document-retrieval
Mentioned in· auto-generated from other terms' related lists
FAQ
- What is chunking in RAG and AI search?
- Chunking is the step that splits a document into smaller segments (chunks) before they are embedded and indexed for retrieval. In a RAG-style pipeline a long page is rarely retrieved whole; it is broken into chunks, typically no more than a few hundred tokens each, and the chunk, not the page, is usually the unit a retrieval system scores and hands to the model. Chunking happens upstream of embedding and retrieval in a retrieval-augmented generation pipeline.
- What is a typical chunk size?
- In documentation-style RAG examples chunks are often on the order of a few hundred tokens, but there is no single correct size, and in practice they range from tens to over a thousand tokens. It depends on the embedding model's input window, the content's structure, and the query patterns. Smaller chunks retrieve more precisely but lose surrounding context; larger chunks keep an idea intact but dilute the embedding and may exceed model limits. Strategies include fixed-size windows (with optional overlap between adjacent chunks), structural splits (by paragraph, heading, or sentence), and semantic splits (grouping topically close sentences). The chunk size is set by the retrieval system operator, not by the publisher whose content is being chunked.
- Can I control how my content gets chunked?
- No. A publisher does not choose the chunk size, the split boundaries, the overlap, or whether the retrieval system prepends context to each chunk (contextual chunking). Those are decisions made by whoever operates the retrieval pipeline (the AI engine or RAG application). What a publisher controls is whether each passage still makes sense after it is cut out of the surrounding page. The durable response to chunking is not guessing a chunk size you cannot set; it is writing self-contained passages that survive whatever boundary the system chooses.
- Does chunking affect whether AI engines cite my content?
- Indirectly. If a passage becomes uninterpretable once isolated from its page (for example, a sentence whose subject is only named in a heading three paragraphs up), it is harder to retrieve for a relevant query and harder for the model to use if retrieved. Anthropic's contextual-retrieval work showed that preserving context around chunks reduced retrieval failure substantially on its own corpora. A publisher cannot add that context the way a retrieval operator can, but can approximate the benefit by stating each passage's subject within the passage itself.
Sources & further reading
- Anthropic: Introducing Contextual Retrieval (September 19, 2024; standard chunking, context-loss failure mode, contextual chunking results)2024-09-19
- Lewis et al.: Retrieval-Augmented Generation for Knowledge-Intensive NLP Tasks (arXiv:2005.11401, May 2020; retrieve-then-generate, the pipeline chunking feeds)2020-05-22
- Günther et al. (Jina AI): Late Chunking: Contextual Chunk Embeddings Using Long-Context Embedding Models (arXiv:2409.04701, September 2024; embed-whole-document-then-split, a parallel approach to contextual chunking)2024-09-07
Get the monthly digest
New terms shipped that week, plus one observation from the AI-citation tracker.