Research · Dispatch #5 ·

GEO's most-cited numbers, checked against the papers they come from

The GEO field rests on a few foundational studies, but the headline figures travel into the marketing with the methodology stripped off. We read the three papers behind the numbers against how they are cited. The famous '40% boost' is a position-adjusted word-share proxy on a 2023 GPT-3.5 testbed, and the paper's own real-engine headline quietly switches to a different metric; a multi-actor re-test found most content tactics ineffective or negative; and a verifiability audit shows only about half of AI-generated sentences are fully supported by their own citations.

The fifth GEO Glossary dispatch. The earlier four were about our own citation data. This one steps back to the literature the whole field quotes and asks a smaller question: when a GEO article cites a number, does the paper behind it actually say that?

The short answer is that the numbers are usually real but the conditions attached to them are usually gone. The most-quoted figure in the field, "GEO can boost visibility by up to 40%," is a real sentence from a real abstract. Almost nothing else makes the trip into the marketing with it.

The "40%" and what it measured

The "up to 40%" comes from Aggarwal et al.'s 2023 GEO paper (arXiv:2311.09735). It is in the abstract, so quoting it is not invention. The question is what the 40% is a percentage of, and several conditions ride underneath it that tend to get dropped.

First, the metric. The paper scores visibility with Position-Adjusted Word Count, or PAWC: the position-weighted share of an answer's text attributed to a source, computed over the sentences that cite it. It measures how much of one answer is attributed to you, weighted toward earlier positions, not how often you are cited across queries or whether you rank first among the sources. A PAWC gain is not a "you will be cited 40% more" figure.

Second, the testbed, and a metric switch hiding in plain sight. The numbers come from GPT-3.5-turbo generating an answer over the top five Google results, in 2023: a constructed engine, not a deployed product. And the headline changes metric between settings. The 40% is a PAWC gain on that synthetic bench. When the paper runs the same methods against the real Perplexity.ai of the time, its headline becomes "37%", but that 37% is a different metric, a model-judged "Subjective Impression" score, not PAWC. On the same metric, PAWC, the real-engine gain is about 22%, not 40%. So anyone who lines up the famous 40% next to the Perplexity 37% and concludes the effect holds on real engines has been switched onto a different ruler; the like-for-like number is 22%. To be fair to the paper, none of this is hidden: its Perplexity table prints the 22% and the 37% side by side, each under its own metric. The switch lives in the framing, not the data, in an abstract and intro that foreground the larger number, and in the marketing downstream that repeats a figure without its ruler.

Third, the scope. The paper's Results section names three effective methods, Cite Sources, Quotation Addition, and Statistics Addition, and reports their gain as a "30-40%" range on PAWC, with the single best method at about 41%. It prints absolute position-adjusted scores (no-modification baseline 19.3; Quotation Addition 27.2; Statistics Addition 25.2; Cite Sources 24.6). It does report some per-method percentages, but only broken out by a page's starting SERP rank, in a separate table, never as one headline number per method. So when a guide hands you a clean per-method league table, those single figures are not the paper's: at best they are a reader's division of the absolute scores (27.2 over 19.3 is about +41%), and several of the ones in circulation are not even that.

And the finding most often missing: not everything worked. The same nine-method experiment found Authoritative tone produced, in the paper's own words, "no significant improvement," and Keyword Stuffing scored below the no-modification baseline, a negative effect. The headline that survives into the marketing is the 40%; the two cautionary results rarely travel with it.

Read against a sample of widely shared "GEO statistics" pages, the patterns recur: the percentage quoted with no mention of the metric, the model, or the year; the proxy metric relabeled as a "citation rate"; and a precise per-method number pinned where the paper gives only a collective ceiling or a rank-conditional one (one guide credits adding statistics with a specific 41%, but in the paper the ~41% is the single best method's figure, that method is adding quotations, and adding statistics comes in nearer 31%). We keep these examples anonymous here and will name them in a longer write-up; the point is the pattern, not any one page.

The multi-actor re-test the marketing has not caught up with

There is a second, newer paper the popular GEO write-ups have not absorbed yet, a NeurIPS 2025 benchmark called C-SEO Bench (Puerto et al., arXiv:2506.11097). It re-tested the same family of content methods, but scored them by citation ranking (which source the engine actually cites) and under multi-actor conditions: not one page optimizing while the rest of the web holds still, but many pages adopting the same tactic at once.

Its verdict is blunt. From the abstract: "most current C-SEO methods are not only largely ineffective but also frequently have a negative impact on document ranking." Out of 54 method-and-domain cases, three reached statistical significance. The paper finds that simply being placed first in the model's context is, in its abstract's words, "significantly more effective" than the content tactics; on its retail task the means work out to roughly 7.7x (2.77 versus 0.36 mean rank improvement, both with wide spread, our calculation from its table). And whatever gains exist erode toward zero as more sources adopt the same method: the paper's words are "congested" and "zero-sum."

Honesty cuts both ways, so two qualifiers. Two methods (the paper's "LLM guidance" and "content improvement") did help, but only in narrow domains, so "most ineffective" is not "nothing works." And this is citation ranking under competition, a different and harder thing than the word-count share Aggarwal measured. The two papers are not in conflict; C-SEO Bench says so itself, noting the difference is "due to the metrics used" and that Aggarwal's own data, read for citation ranking, points the same way. Read together they say: content tactics are weak, contested levers, and the lever that dominated in these tests was retrieval position, not loading a page with the approved ingredients.

The citation floor underneath all of it

A third paper sets a floor the optimization conversation usually skips. Before asking how to get cited more, it is worth knowing how reliably engines cite at all. Liu, Zhang, and Liang's "Evaluating Verifiability in Generative Search Engines" (Stanford; EMNLP Findings 2023, arXiv:2304.09848) is the foundational audit. Averaged across the four engines it studied, it found that only 51.5% of generated sentences were fully supported by their citations (citation recall), and only 74.5% of citations actually supported the sentence they were attached to (citation precision). Roughly one citation in four did not fully support its claim. The authors call the result "concerningly low ... given their facade of trustworthiness."

Two honesty notes on those averages. The 51.5% recall figure is dragged down by one weak engine (YouChat, near 11 percent; the other three average about 65 percent), and precision held up better than recall across the board, so the floor is less uniform than a single number suggests. And the study is old: February and March 2023, on Bing Chat, NeevaAI, perplexity.ai, and YouChat, of which only perplexity.ai still runs under its own name (NeevaAI shut down, Bing Chat became Copilot, YouChat has been repositioned). So 51.5% / 74.5% is the seminal benchmark, not a current reading of today's engines, and no public study has re-run its exact method at the same scale. The structural lesson outlasts the numbers: being cited is not the same as being represented accurately.

How to read a GEO number

The audit reduces to a short checklist. When you meet a GEO statistic, before repeating it, ask:

  • What does it measure? A proxy like PAWC (attributed word share) is not a citation rate, and two proxy metrics in the same paper are not interchangeable. "Cited," "occupies more of the answer," and "scores higher on a model-judged impression" are three different outcomes.
  • On what engine, in what year? A 2023 GPT-3.5 synthetic testbed is not a 2026 deployed engine, and several engines in the foundational papers no longer exist.
  • Single-actor or multi-actor? A tactic that helps when one page uses it can vanish, or reverse, once everyone does.
  • Is the number the paper's, or a reader's arithmetic? Absolute scores are usually quotable; per-method percentages are often derived. Both are legitimate, but only one is "what the paper says."
  • What did the paper find that did not work? The null and negative results (authoritative tone, keyword stuffing) are the first thing the marketing drops and often the most useful thing to keep.

A note on our own numbers

The single most common error in this material is reading a figure off the wrong column or the wrong metric, and we were not exempt. Until we re-read the GEO paper's results table against the PDF this week, our own glossary had been citing the table's plain word-count column as the position-adjusted metric the headline is actually built on; the columns differ by a point or two per method, a small gap but exactly the kind of slip this dispatch is about. We corrected it across the glossary before publishing this. We present the papers' absolute scores as the quoted data and any per-method percentage as our own derivation. An audit of other people's sourcing has no standing if its own is loose, and the clearest example we have of the error we are flagging is one we had to fix in our own pages first.

What this is and is not

This is a reading of three papers against how they are commonly cited, not a new experiment. The misquote patterns are illustrative, drawn from a sample rather than a census of the field; the structured version, with named and dated sources, belongs in the longer write-up. The papers themselves we checked against their primary sources. None of this says GEO is fake or that content quality does not matter. It says the field's confident numbers are narrower than they sound, and that the durable levers, retrievable and self-contained and accurately cite-able content, are duller than the headline tactics. That is usually how it goes with real findings.

More dispatches