/terms/context-rot · 4 min read · advanced

Context rot

Context rot is the empirically observed degradation in an LLM's output quality as its input context grows longer, even on simple tasks and well below the model's maximum context window. Formalized by Chroma's 2025 study across 18 models, it is distinct from context-window overflow (hitting the hard token limit) and broader than lost in the middle (which is specifically positional): context rot is degradation along the length axis. For publishers it reinforces that a retrieval pipeline keeps only a bounded, high-signal context, so concise self-contained passages survive better than verbose padding.

Citation status

ChatGPTPerplexityClaudeCopilotGemini

Last checked 2026-06-04

Context rot is the empirically observed tendency of large language models to produce less reliable output as their input context grows longer, even on simple tasks and well before the model's maximum context window is reached. The term was formalized by Chroma's 2025 study, which tested 18 frontier models across the Anthropic, OpenAI, Google, and Alibaba families and found that models "do not use their context uniformly; instead, their performance grows increasingly unreliable as input length grows"1.

The key distinction is from context-window overflow. Overflow is hitting the hard token limit, after which content is truncated. Context rot happens well before that limit: a model with a large context window can already degrade at a fraction of it. The cause is not running out of room; it is that more tokens make the model's use of any one of them less reliable.

Status in 2026

Chroma's contribution was methodological: it isolated input length as the variable while holding task complexity constant, so the degradation it measures is attributable to length itself rather than to harder questions hiding in longer inputs. The study also found the effect is non-uniform: distractor passages (plausible but irrelevant content) amplify degradation as input grows, and lower semantic similarity between the query and the target information increases the rate of decline1.

Context rot is closely related to, but broader than, lost in the middle. Lost in the middle is the positional finding (Liu et al. 2023): information in the middle of a long context is used less reliably than information at the edges, a U-shaped curve2. Context rot is the length-axis finding: total input length degrades reliability regardless of where the relevant information sits. The two are cousins describing the same underlying fact, that a model does not use its assembled context uniformly, from different angles. Chroma's study does not frame its results in terms of the lost-in-the-middle paper; positional effects appear in some of its experiments, but its headline axis is length, not position.

Honest scope: context rot is primarily a finding for people building LLM applications (curate the context you assemble; do not dump everything in and assume the model will weight it perfectly). Its magnitude is model-dependent and the 2025 measurements are a snapshot, not a constant.

How to apply

Like lost in the middle, context rot is mostly a property a publisher does not control: you do not decide how much total context a retrieval pipeline assembles around your content. What it changes is the realistic picture of what your content competes inside.

  • Write concise, high-signal passages. Verbose content that pads a retrieved context contributes to the very length that degrades reliability. A passage that makes its point in fewer tokens is easier for the model to use whether the assembled context is short or long. This compounds with passage-level optimization.
  • Do not assume a large context window means your content will be fully used. A model advertised with a huge window still rots; being retrieved into a long context is not the same as being used reliably within it.
  • For your own LLM or RAG builds, curate rather than dump. If you operate a retrieval system, context rot is a direct argument for retrieving fewer, higher-relevance chunks rather than stuffing the window. This is the application-side mirror of the publisher-side advice.

What to skip:

  • Padding content to look comprehensive. Length that does not add signal works against the model that retrieves it.
  • Treating a model's advertised context length as a reliability guarantee. The window is a capacity limit, not a promise of uniform use across it.
  • Treating the 2025 magnitudes as fixed. The effect is model-dependent and will shift as models change; verify against current behavior.

How it relates to other concepts

  • The length-axis cousin of lost in the middle: lost in the middle is positional (where in the context), context rot is about length (how much context). Both are instances of a model not using its assembled context uniformly, and both set a realistic ceiling on what content optimization can guarantee.
  • A reason chunking and curated retrieval matter: context rot is a direct argument for retrieving fewer, denser, higher-relevance chunks rather than assembling a large context, because length itself degrades reliability.
  • Reinforces passage-level optimization: concise self-contained passages contribute less to context-length degradation and stay usable whether the assembled context is short or long.
  • A ceiling sibling in the ai-behavior cluster with citation precision and hallucination grounding: like those, context rot describes a measured model behavior that bounds what publisher-side optimization can achieve, rather than a tactic.

Footnotes

  1. Kelly Hong, Anton Troynikov, and Jeff Huber. "Context Rot: How Increasing Input Tokens Impacts LLM Performance." Chroma technical report, July 14, 2025. trychroma.com/research/context-rot. Evaluated 18 LLMs across the Anthropic, OpenAI, Google, and Alibaba families (including GPT-4.1, Claude 4, Gemini 2.5, and Qwen3). Core finding: models "do not use their context uniformly; instead, their performance grows increasingly unreliable as input length grows," observed even on simple tasks and well below maximum context windows. Methodologically isolates input length while holding task complexity constant, so the measured degradation is attributable to length rather than to harder questions in longer inputs. Distractors (plausible but irrelevant content) amplify degradation as input grows, and lower query-target semantic similarity increases the rate of decline. Distinct from context-window overflow (hitting the hard token limit). 2

  2. Liu, N.F., Lin, K., Hewitt, J., Paranjape, A., Bevilacqua, M., Petroni, F., and Liang, P. "Lost in the Middle: How Language Models Use Long Contexts." arXiv:2307.03172 (2023); Transactions of the ACL, Vol. 12, 2024. The positional finding: accuracy is highest when relevant information sits at the start or end of a long context and degrades in the middle, a U-shaped curve. See the lost in the middle entry for detail. Cited here as the positional cousin of context rot's length axis; this glossary frames the two as related findings, not the same result.

Part of AI behavior· editorial cluster, not a semantic link

Also in this cluster: Citation hallucination · Citation precision and recall · Hallucination grounding · Lost in the Middle · Prompt injection · +1 more

Mentioned in· auto-generated from other terms' related lists

FAQ

What is context rot in LLMs?
Context rot is the tendency of large language models to produce less reliable output as their input context grows longer, even on simple tasks and well before the maximum context window is reached. It was formalized by Chroma's 2025 study, which tested 18 frontier models and found they do not use their context uniformly: performance grows increasingly unreliable as input length grows. The cause is not running out of room; it is that more tokens make the model's use of any one of them less reliable.
Is context rot the same as lost in the middle?
Related but not the same. Lost in the middle is the positional finding (Liu et al. 2023): information in the middle of a long context is used less reliably than information at the edges, a U-shaped curve. Context rot is the length-axis finding (Chroma 2025): total input length degrades reliability regardless of where the relevant information sits. They are cousins describing the same underlying fact, that a model does not use its assembled context uniformly, from different angles. Chroma's study does not frame its results in terms of the lost-in-the-middle paper; this glossary presents them as related findings, not one result.
Does a bigger context window fix context rot?
No. A large context window is a capacity limit, not a guarantee of uniform use. Chroma's study found degradation well below the maximum window across all 18 models tested, including models advertised with very large windows. Being retrieved into a long context is not the same as being used reliably within it. The effect is model-dependent and the 2025 measurements are a snapshot, so verify against current behavior rather than assuming a fixed magnitude.

Sources & further reading

Get the monthly digest

New terms shipped that week, plus one observation from the AI-citation tracker.

More about what you'll get