GEO Glossary

/terms/citation-precision · 5 min read · intermediate

Citation precision and recall

Citation precision is the fraction of citations in an AI engine's response that actually support the sentence they are attached to. Citation recall is the fraction of generated sentences that are fully supported by their citations. Both are model-behavior metrics, not publisher-visibility metrics: they measure how faithfully an AI engine uses the sources it cites, not how often a publisher's content appears as a source.

Citation status

ChatGPTPerplexityClaudeCopilotGemini

Last checked 2026-05-30

Citation precision is the fraction of citations in an AI engine's response that actually support the sentence they are attached to. Citation recall is the fraction of generated sentences that are fully supported by their citations. Both are model-behavior metrics: they measure how faithfully an AI engine uses the sources it cites, not how often a publisher's content appears as a source. The two are paired but independent dimensions of the broader question "does this engine cite things correctly?"

The benchmark measurement is Liu, Zhang, and Liang's "Evaluating Verifiability in Generative Search Engines" (Findings of EMNLP 2023; arXiv:2304.09848)1, a human-evaluation audit of four commercial generative search engines (Bing Chat, NeevaAI, perplexity.ai, YouChat) that reported an average citation precision of 74.5% and an average citation recall of 51.5%. The asymmetry is the load-bearing finding: as of late 2023, commercial AI search engines achieved higher precision than recall. When they did attach a citation, it was usually roughly supportive of the claim, but nearly half of claim-bearing sentences went without full citation support at all.

For publishers, the metric distinction matters because being cited does not imply being represented correctly. A 74.5% precision figure means roughly one in four citations in commercial AI search responses did not fully support the attached claim, ranging from partial or weak support to outright misrepresentation; a 51.5% recall figure means nearly half of generated sentences in those responses lack a supporting citation at all. Both failure modes are invisible at the attribution rate or citation match rate layers; only direct claim-alignment evaluation surfaces them.

What are citation precision and recall?

Liu et al. define citation precision as the fraction of citations a generative search engine outputs that actually support the sentence they are attached to. A citation passes the precision check if a careful reader, presented with the engine's claimed citation and the cited source, would judge that the source supports the adjacent sentence. Citation recall is the fraction of generated sentences (in answers to factual queries) that are fully supported by at least one citation, where "fully supported" requires that the citation set together backs the sentence's claim without requiring the reader to rely on other unstated sources.

The two metrics decompose citation behavior in a 2x2 frame:

Citation present No citation
Supports the claim High precision, contributes to recall No citation to evaluate; sentence is unattributed (this is the recall-failure quadrant: a citable claim left without citation support)
Does not support the claim Low precision; the citation does not fully support the claim, ranging from partial support to outright misrepresentation No citation to evaluate; sentence is unattributed (recall-failure case)

The 2x2 surfaces why aggregating precision and recall into a single "citation accuracy" number hides the dominant failure mode. A high-precision low-recall engine produces accurate citations sparingly. A low-precision high-recall engine produces many citations but a substantial fraction are wrong. The publisher-side risk is different: low-recall engines may simply not cite a publisher at all (a cite-ability gap), while low-precision engines may cite the publisher and attribute claims the publisher's content does not actually make.

Status in 2026

The Liu, Zhang, and Liang baseline remains the most-cited measurement of citation faithfulness, but it is also dated and partially obsolete:

  • NeevaAI announced shutdown on May 20, 2023 and stopped consumer search on June 2, 2023, before the paper's December 2023 publication; the NeevaAI numbers are historical only.
  • Bing Chat was renamed to Microsoft Copilot at Microsoft Ignite on November 15, 2023; the product persists but the model backend, retrieval pipeline, and citation rendering have iterated multiple times since the audit.
  • perplexity.ai and YouChat persist but have iterated through multiple model backends. Perplexity's Sonar 4-variant family and YouChat's various model selections produce different citation behavior than the late-2023 Bing Chat that Liu et al. audited.

The adjacent follow-up audit is Li and Sinnamon (2024)2, which audited ChatGPT, Bing Chat, and Perplexity at a smaller sample size (48 queries across 4 topics over 7 days) and documented sentiment bias and commercial / geographic source-authority biases rather than replicating the precision and recall framework. A separate Profound longitudinal citation study (covering ~680 million citations across ChatGPT, Google AI Overviews, and Perplexity from August 2024 through October 20253) reported per-engine source-concentration patterns (Wikipedia accounted for ~7.8% of all ChatGPT citations and ~47.9% within ChatGPT's top 10 sources; Reddit accounted for ~6.6% of Perplexity citations and ~2.2% of Google AI Overview citations; only ~11% of cited domains were shared between ChatGPT and Perplexity), but neither study measured precision or recall in Liu et al.'s sense.

Treat the 74.5% / 51.5% baseline as the historical anchor that any current measurement should reference against, not as a 2026 statement of engine behavior. The metric framework remains the right one; the per-engine numbers should be re-measured.

The two metrics in detail

Citation precision

Per-citation question: does the cited URL actually support the sentence the citation is attached to? Human evaluation typically uses a three-level judgment scale (yes / partial / no) on the question "would a careful reader, given the citation and the source, judge that the source supports the adjacent sentence's claim?" A "partial" judgment captures cases where the source supports part of the claim but not the entire claim (e.g., the source confirms a statistic but not the causal interpretation around it).

Failure modes Liu et al. observed:

  • Off-topic citation: the cited source is on a related topic but does not address the specific claim.
  • Out-of-context quotation: the cited source contains the cited text but in a context that means something different.
  • Misattributed inference: the cited source supports a weaker claim than the one the engine made, and the citation papers over the gap.
  • Hallucinated citation: the cited URL does not contain the cited content at all (the most severe failure).

Per-engine precision rates vary; the 74.5% average masks the per-engine spread (one engine in the audit scored noticeably worse than the average).

Citation recall

Per-sentence question: is this sentence fully supported by at least one citation? "Fully supported" is strict: a sentence whose claim requires inference beyond the cited source's content does not count as fully supported, even if the citation is partially relevant. Recall failure modes:

  • Uncited factual claim: the sentence makes a factual claim with no citation attached.
  • Partially cited claim: a citation is attached but supports only part of the claim.
  • Citation-padded interpretation: the sentence offers interpretation or commentary that the citation cannot support even though the citation is on the same topic.

Recall is generally harder for engines to optimize because increasing recall (more sentences cited) can decrease precision if the additional citations are attached less carefully (more chances for citation errors). The 51.5% recall figure suggests that as of 2023 engines erred on the side of citing less than they should, possibly to avoid lowering precision further.

How to apply

Add citation precision tracking to your citation probe protocol by extending the recording schema. For each citation observed in a weekly probe round:

  • Record the verbatim sentence the citation is attached to.
  • Record the URL the citation points to.
  • Score a three-level judgment (yes / partial / no) on whether the cited source supports the adjacent sentence's claim.

After several probe rounds (eight weekly rounds in this glossary's manual protocol; the right number depends on probe-set size and engine count) you have a per-engine citation precision sample against your fixed prompt set. For publishers tracking their own representation specifically, the practical compromise is to score only citations of your own content (per-publisher precision rather than full-pool precision); this is much faster to evaluate and surfaces the publisher-relevant question "is the engine representing my content correctly when it does cite me." Engine-wide precision is a different question that benchmark-style audits are better suited to answer.

What to skip: trying to compute citation recall at scale on every probe round. Recall requires evaluating every sentence in every response, not only the cited ones; it is much more time-expensive than precision. Sample recall quarterly on a smaller subset of responses rather than weekly on every response.

What remains contested or unverified

  • Whether 2026 engines have higher precision than the 2023 baseline. Newer model backends and improved retrieval-augmented generation pipelines may have improved precision, but no public benchmark has replicated Liu et al.'s methodology at comparable sample size. The plausible-but-unverified hypothesis is that precision has improved on engines with stronger grounding pipelines (Perplexity, Google AI Overview) and remained mixed on engines with looser grounding (ChatGPT non-search responses, Gemini ungrounded mode).
  • Whether recall has improved. The Profound 2024-2025 longitudinal data documents per-engine source-concentration patterns at scale but does not measure recall in Liu et al.'s sentence-level sense. Whether 2026 engines now cite a higher fraction of claim-bearing sentences than the 2023 51.5% baseline is not measured by the available follow-up studies; more frequent citation surfacing does not necessarily mean better per-sentence support coverage.
  • Per-engine precision-recall trade-offs. Engines may have moved in different directions: some optimizing precision at the cost of recall (cite less but cite well), others optimizing recall at the cost of precision (cite more, some incorrectly). The per-engine current state would require a 2026-equivalent of Liu et al.'s audit.
  • Subjectivity of human evaluation. The 74.5% / 51.5% figures are human-judged. Different evaluators applying the same criteria to the same responses can produce different scores at the margins; Liu et al. report inter-annotator agreement that is reasonable but not perfect.

How it relates to other concepts

  • Sequential failure-mode pair with hallucination grounding: hallucination grounding asks whether the answer is anchored in retrieved or provided source content at all; citation precision asks whether the displayed citation actually supports the attached claim. An ungrounded answer fails the hallucination check; a grounded but imprecisely cited answer passes hallucination but fails precision. Both failures harm publishers, but the publisher-visible signature differs.
  • Distinct from citation match rate: citation match rate measures display (linked vs unlinked); citation precision measures reasoning (does the cited URL support the claim). A response can have 100% match rate and 50% precision.
  • Distinct from attribution rate: attribution rate measures whether a source is cited at all (a publisher-visibility metric); citation precision measures whether the citation is faithful to the source (a model-behavior metric).
  • Tracked via the citation probe protocol: the protocol's recording schema can be extended with a per-citation claim-alignment field to sample precision over weekly probe rounds.
  • Adjacent to the Citation vs Mention vs Link taxonomy: the 2x2 taxonomy classifies citations on display dimensions (linked / unlinked, citation / mention); citation precision is an orthogonal quality dimension that applies within each cell of the 2x2.
  • Counterpoint to sycophancy vs cite-able fact: sycophancy is the engine over-agreeing with the user without grounding; citation precision is the engine grounding but doing so inaccurately. Different failure modes of the same underlying trust requirement.

Footnotes

  1. Nelson F. Liu, Tianyi Zhang, Percy Liang. "Evaluating Verifiability in Generative Search Engines." Findings of the Association for Computational Linguistics: EMNLP 2023, pp. 7001-7025. arXiv:2304.09848; ACL Anthology. Human-evaluation audit of four commercial generative search engines (Bing Chat, NeevaAI, perplexity.ai, YouChat). Reports average citation precision of 74.5% and average citation recall of 51.5%. Two of the four engines audited have changed substantially since: NeevaAI announced shutdown May 20, 2023 and stopped consumer search June 2, 2023; Bing Chat was renamed to Microsoft Copilot at Microsoft Ignite on November 15, 2023. The 74.5% / 51.5% averages should be read as the 2023 baseline; engine-specific 2026 numbers would require a follow-up audit.

  2. Alice Li and Luanne Sinnamon. "Generative AI Search Engines as Arbiters of Public Knowledge: An Audit of Bias and Authority." Proceedings of the Association for Information Science and Technology, 61: 205-217 (2024). arXiv:2405.14034; Wiley Online Library. Submitted May 22, 2024. Audits ChatGPT, Bing Chat, and Perplexity using 48 queries across 4 topics over a 7-day window, with sentiment analysis and source classification. Findings: sentiment bias varying by query and topic, and commercial / geographic bias in sources (heavy reliance on News and Media, Business, and Digital Media websites). Does not replicate Liu et al.'s precision and recall measurement; an adjacent audit on different dimensions of the same trust requirement.

  3. Profound. "AI Platform Citation Patterns: How ChatGPT, Google AI Overviews, and Perplexity Source Information." Longitudinal study covering approximately 680 million citations from August 2024 through October 2025. tryprofound.com/blog/ai-platform-citation-patterns. Reports per-engine source-concentration patterns (Wikipedia ~7.8% of ChatGPT citations / ~47.9% of ChatGPT's top 10; Reddit ~6.6% of Perplexity citations and ~2.2% of Google AI Overview citations; ~11% domain overlap between ChatGPT and Perplexity). Provides per-engine source-distribution data at scale but does not measure citation precision or recall in Liu et al.'s sentence-level sense.

Mentioned in· auto-generated from other terms' related lists

FAQ

What's the difference between citation precision and citation recall?
Citation precision answers 'when the engine cites a source, does the source actually support the claim?' Citation recall answers 'when the engine makes a claim, is that claim backed by a citation at all?' The two are paired but independent. An engine can have high precision and low recall (every citation it produces is accurate, but most of its sentences lack citations). It can have high recall and low precision (most sentences are citation-backed, but many of those citations do not actually support the adjacent claim). Liu, Zhang, and Liang's 2023 audit of four commercial engines (Bing Chat, NeevaAI, perplexity.ai, YouChat) found average citation precision of 74.5% and average citation recall of 51.5%; the asymmetry suggests that as of 2023, commercial engines were more selective about which sentences to cite at all than they were accurate when they did cite.
Are these 2023 figures still current?
Probably not at the per-engine level. Of the four engines audited by Liu, Zhang, and Liang, two have changed substantially since 2023: NeevaAI announced shutdown on May 20, 2023 and stopped consumer search on June 2, 2023; Bing Chat was renamed to Microsoft Copilot at Microsoft Ignite on November 15, 2023. YouChat and perplexity.ai persist but have iterated through multiple model backends. The 74.5% / 51.5% averages should be read as the 2023 baseline measurement that any 2026 follow-up should compare against rather than as current numbers. The adjacent follow-up is Li and Sinnamon (2024), which audited ChatGPT, Bing Chat, and Perplexity on a smaller scale and documented sentiment and source-authority biases on different dimensions than precision and recall. No publicly available follow-up has replicated Liu et al.'s exact precision and recall metrics at the original sample size.
How is citation precision different from citation match rate?
Citation match rate measures whether a citation is a linked URL versus an unlinked text mention (a rendering property of the engine's response). Citation precision measures whether the cited URL actually supports the claim the citation is attached to (a faithfulness property of the model's reasoning). A response can have a 100% citation match rate (every citation is a clickable link) and a 50% citation precision (half the linked sources do not actually support the claim). The two metrics belong to different parts of the citation-quality stack: citation match rate is about display, citation precision is about reasoning.
How is citation precision different from hallucination grounding?
Hallucination grounding asks whether the engine's answer is anchored in retrieved or provided source content at all, or whether it is generated from training-data recall without retrieval. Citation precision asks the next question: given that the engine did retrieve and cite a source, does the cited source actually support the specific claim the citation is attached to. The two are sequential: an answer can be grounded in retrieval (passing the hallucination check) but still cite the retrieved sources inaccurately (failing the precision check). Both failure modes harm publishers, but in different ways: ungrounded answers may not cite the publisher at all, while imprecisely cited answers may misrepresent the publisher's content while still attributing the misrepresentation to the publisher's URL.
How do I add citation precision tracking to my own measurement program?
Augment your [citation probe protocol](/terms/citation-probe-protocol) recording schema with a per-citation claim-alignment field. For each citation observed in a probe round, record (1) the verbatim sentence the citation is attached to in the AI answer, (2) the URL the citation points to, and (3) a yes / partial / no judgment on whether the cited source actually supports the claim in the adjacent sentence. After several probe rounds (for example, eight weekly rounds in this glossary's manual protocol) you have a per-engine citation precision rate sampled against your fixed prompt set. The judgment step is subjective and time-expensive; the practical compromise is to score only the citations of your own content (per-publisher precision), not the entire citation pool, which still surfaces whether the engine is representing your specific content accurately.

Sources & further reading

Get the weekly digest

New terms shipped that week, plus one observation from the AI-citation tracker.

More about what you'll get