/terms/keyword-stuffing · 4 min read · intermediate
Keyword Stuffing
Citation status
Last checked 2026-05-29
Keyword Stuffing is the Aggarwal et al. 2023 GEO paper's flagship negative result. The paper's Section 4 prose is verbatim load-bearing: "we find such methods have little to no performance improvement on Generative Engine's responses"1, a null finding for the most widely-used classical SEO tactic on generative-engine citation visibility. The Table 1 caption confirms the directional framing: "simple methods such as Keyword Stuffing traditionally used in SEO do not perform very well. However, our proposed methods such as Statistics Addition and Quotation Addition show strong performance improvements." The Table 5 Perplexity.ai prose escalates further, characterizing Keyword Stuffing as "performs 10% worse than the baseline" on that specific engine. The Table 1 main GEO-bench raw PAWC measurement (17.8 vs the no-modification baseline of 19.5, mathematically -8.7%, the only one of the 9 tested methods to fall below baseline) is consistent with the paper's null prose; the load-bearing characterization is the verbatim prose in both Section 4 and Table 5, not the derived percentage. Treat the raw percentage as a transparency check on the prose, not as the headline finding.
This entry documents that finding and is the geo-content-methods cluster's primary counter-evidence anchor against the popular SEO claim that traditional keyword optimization transfers cleanly to generative engines. The 2025 follow-up benchmark, C-SEO Bench2, confirms the null/negative finding under multi-actor production-realistic conditions: most tested C-SEO methods, including Keyword Stuffing, produced near-zero or slightly negative effects on document ranking in multi-domain testing. (C-SEO Bench measures document ranking, not the PAWC citation-share metric Aggarwal uses, so it is corroborating counter-evidence on the broader keyword-stuffing-does-not-transfer hypothesis rather than a direct PAWC replication.) Under both the single-actor synthetic (Aggarwal 2023) and the multi-actor production-realistic (C-SEO Bench 2025) testbeds the entry cites, mechanically inflating keyword density in source pages does not improve generative engine citation under the tested public benchmarks.
Status in 2026
Despite the paper-verbatim null finding, keyword-density work remains widely recommended in 2026 SEO and GEO guides as a primary citation lever. The empirical picture from the two public benchmarks this entry cites is the opposite. Aggarwal 2023 characterizes Keyword Stuffing verbatim as "little to no performance improvement" in Section 4 main GEO-bench prose and escalates to "performs 10% worse than the baseline" in Table 5 Perplexity-specific prose; the underlying raw PAWC measurement is consistent (17.8 vs baseline 19.5, the only one of 9 tested methods below baseline). C-SEO Bench 2025's multi-actor analysis extends the finding to a different metric (document ranking) and a different testbed shape, again with near-zero or negative effect. The popular folk-wisdom claim that "more keywords = more AI citation" has no empirical support in either of the public benchmarks measured to date.
The negative finding is load-bearing for understanding the difference between classical SEO and GEO. Classical SEO targets keyword-matching ranking algorithms; generative engines retrieve passages and condition LLM generation on the retrieved content. Keyword density is an input that matters more for the first kind of system than the second; the Aggarwal paper's central argument is that GEO requires different tactics, not just rebranded keyword work. The paper's choice to test Keyword Stuffing as a control is the empirical anchor for this distinction.
What the negative finding does not mean: it does not mean keyword research is useless. Understanding which queries an audience runs, which terms are most-searched in a topic cluster, and which long-tail variations have measurable volume remains useful work for discoverability and content targeting. The Aggarwal negative result is specifically about adding keywords to existing source content to boost citation visibility, not about understanding which queries to target with content in the first place. Practitioners writing for AI search should continue researching the queries their audience runs; mechanical keyword-density inflation in source pages is the failure mode.
What the paper actually tested
The Aggarwal paper applied Keyword Stuffing as an LLM-prompted source-content modification: GPT-3.5-turbo was instructed to rewrite source pages to include more keywords from the target query. The intervention was measured against the Position-Adjusted Word Count (PAWC) metric on the GEO-bench benchmark with top-5 Google sources, temperature=0.7, 5 responses per query, in 2023.
Table 1 (main GEO-bench) PAWC values, sorted high to low:
| Method | PAWC | Mathematical relative gain |
|---|---|---|
| Quotation Addition | 27.8 | +42.6% |
| Statistics Addition | 25.9 | +32.8% |
| Fluency Optimization | 25.1 | +28.7% |
| Cite Sources | 24.9 | +27.7% |
| Technical Terms | 23.1 | +18.5% |
| Easy-to-Understand | 22.2 | +13.8% |
| Authoritative | 21.8 | +11.8% (paper-verbatim null) |
| Unique Words | 20.7 | +6.2% |
| Baseline | 19.5 | (reference) |
| Keyword Stuffing | 17.8 | -8.7% |
The mathematically derived relative gains above are computed directly from the Table 1 PAWC values; the paper itself frames its headline more conservatively as "up to 40%" and names a verbatim top-3 (Cite Sources, Quotation Addition, Statistics Addition) at a "30-40% relative improvement" range. Cite Sources appears in the named top-3 for combined-method strength rather than for standalone PAWC ranking (it is 4th standalone). Per-engine results vary: Table 5 (Perplexity.ai) reports a different baseline of 24.0 and a best method at +22%, not the main bench's 30-40% range.
For Keyword Stuffing specifically, the paper's prose graduates with engine. The main GEO-bench (Table 1) prose in Section 4 is "we find such methods have little to no performance improvement", a null framing consistent with the Table 1 raw -8.7%. The Perplexity-specific prose accompanying Table 5 escalates: "our observations such as the ineffectiveness of traditional methods used in SEO such as Keyword Stuffing are further highlighted, as it performs 10% worse than the baseline." The Table 5 raw Keyword Stuffing PAWC on Perplexity is 21.9 (vs the Perplexity baseline of 24.0), confirming the directional escalation. Both prose framings (null on main bench, actively worse on Perplexity) point in the same direction. The combined paper-verbatim picture is that Keyword Stuffing is the only method the paper characterizes consistently as either non-helpful or counter-productive across both tables.
The paper does not report p-values or statistical significance tests for any of the per-method gains; the prose framing is the load-bearing characterization in both directions (top methods explicitly called out as effective, Keyword Stuffing explicitly called out as not).
How to apply
The practical takeaway for content-writing programs:
- Do not pad source content with query keywords expecting AI citation lift. The paper benchmark and the C-SEO Bench follow-up agree: mechanical keyword inflation does not produce measurable lift in any tested condition. The effort is better spent on the paper's verbatim named top-3 of effective methods (Cite Sources, Quotation Addition, Statistics Addition) plus Fluency Optimization (3rd by standalone Table 1 PAWC, strongest in the Fluency-plus-Statistics combination experiment).
- Keyword research stays useful for discoverability and audience targeting. Understanding the query distribution your audience runs, the long-tail variations they use, and the competitive landscape on those queries remains valuable; the negative finding is specifically about mechanically inflating density in already-written content, not about query selection.
- Treat "sophisticated keyword variation produces better results than crude stuffing" as untested. The paper measured "Keyword Stuffing" as adding more query-relevant keywords; it did not separately test LSI (Latent Semantic Indexing) keywords, semantic keyword expansion, or long-tail variation as distinct interventions. Practitioner claims that "smart" keyword work produces different results from "crude" stuffing are speculation rather than benchmark-derived finding.
- Use the negative finding as cluster discipline. When a content marketing source claims "keyword optimization is the foundation of GEO," the Aggarwal benchmark is the primary counter-evidence anchor: paper-verbatim null on the largest single SEO tactic.
What to skip:
- "Updated for AI search" keyword-density tools that promise specific keyword counts as targets. The paper measured one Keyword Stuffing intervention; specific density targets are not paper-derived.
- Conflating keyword research (useful) with keyword stuffing (paper-verbatim null). The two are different activities; the negative finding is about the latter only.
How it relates to other concepts
- Counterpoint to the paper's verbatim named top-3 Aggarwal methods (Cite Sources, Quotation Addition, and Statistics Addition, at a stated 30-40% relative improvement range): these are the methods the paper actively recommends; Keyword Stuffing is the method the paper documents as not working. The paper's combined-method analysis additionally highlights Fluency Optimization paired with Statistics Addition as the strongest combination (+5.5% over any single method, §5.3). The two-side framing (named-top-3 effective vs Keyword Stuffing's negative result) is the paper's central argument that GEO is structurally different from classical SEO.
- Paired with Authoritative Statement Strength as the paper's two null-or-negative methods: Authoritative tone is paper-verbatim "no significant improvement" (raw +11.8% but framed as null); Keyword Stuffing is paper-verbatim "little to no performance improvement" with the raw number actually below baseline. Both are widely-recommended SEO tactics that the paper measured and found not to transfer.
- Confirmed by C-SEO Bench 2025: the multi-actor production-realistic benchmark tested 7 of Aggarwal's 9 methods (including Keyword Stuffing in some form) and found most C-SEO methods, including the keyword-density family, produced near-zero or slightly negative effects on document ranking.
- Distinct from Statistical Density: Statistics Addition (adding sourced statistics) is one of the paper's top methods at +33% relative gain; Keyword Stuffing (adding query-keywords) is the paper's negative result at -8.7%. The two interventions are surface-similar (both modify source content) but produce opposite measured effects.
- Useful counter-anchor for cite-ability discipline: cite-ability emphasizes self-contained claims with attribution; keyword stuffing produces content that may be longer and more keyword-dense without the substantive content features (sourced quotations, statistics, citations, fluency) that the paper measures as effective.
Footnotes
-
Aggarwal et al. "GEO: Generative Engine Optimization." arXiv:2311.09735, November 2023 (KDD 2024). Princeton + IIT Delhi + Georgia Tech + Allen Institute for AI. Tests 9 LLM-prompted content-modification methods against a Position-Adjusted Word Count (PAWC) metric on the GEO-bench benchmark. Table 1 main GEO-bench PAWC values: Quotation Addition 27.8, Statistics Addition 25.9, Fluency Optimization 25.1, Cite Sources 24.9, Technical Terms 23.1, Easy-to-Understand 22.2, Authoritative 21.8, Unique Words 20.7, no-modification baseline 19.5, Keyword Stuffing 17.8. The paper's verbatim Results section names a top-3 (Cite Sources, Quotation Addition, Statistics Addition) with a "30-40% relative improvement" range. Per-engine results vary: Table 5 (Perplexity.ai) reports a different baseline of 24.0 and the best method at +22%, not the main bench's 30-40% range. For Keyword Stuffing specifically: paper Section 4 verbatim: "we also evaluate the idea of using keyword stuffing, i.e. adding more relevant keywords to the website content. While this technique has been widely used for Search Engine Optimization, we find such methods have little to no performance improvement on Generative Engine's responses." Table 1 caption verbatim: "Performance improvement of GEO methods on GEO-bench... Compared to the baselines simple methods such as Keyword Stuffing traditionally used in SEO do not perform very well. However, our proposed methods such as Statistics Addition and Quotation Addition show strong performance improvements across all metrics considered." Raw measurement: PAWC 17.8 vs no-modification baseline of 19.5, the only one of the 9 methods to score BELOW baseline (mathematically -8.7%). The paper does not report p-values or significance tests; the prose framing is the load-bearing characterization. Testbed: GPT-3.5-turbo, top-5 Google sources, temperature=0.7, 5 responses per query, 2023. Primary-source re-verified 2026-05-30 against the ar5iv HTML mirror of arXiv:2311.09735: all Table 1 PAWC values, Table 1 caption verbatim, Section 4 prose, the verbatim named top-3 quote, and Table 5 Perplexity.ai per-engine numbers (including Keyword Stuffing 21.9 with paper prose 'performs 10% worse than the baseline') confirmed. ↩
-
Puerto, Gubri, Green, Oh, Yun. "C-SEO Bench: Does Conversational SEO Work?" arXiv:2506.11097, submitted June 2025; accepted at NeurIPS Datasets & Benchmarks 2025. Tested 7 of Aggarwal et al. 2023's 9 methods plus 2 novel methods across question-answering and product-recommendation tasks on multiple domains and varying multi-actor adoption rates. Verbatim conclusion: "most current C-SEO methods are not only largely ineffective but also frequently have a negative impact on document ranking ... as we increase the number of C-SEO adopters, the overall gains decrease, depicting a congested and zero-sum nature of the problem." For Keyword Stuffing specifically: confirms the Aggarwal negative finding under multi-actor production-realistic conditions; the keyword-density family produced near-zero or slightly negative effects on document ranking. Note on relationship to Aggarwal: C-SEO Bench's primary outcome metric is document ranking in retrieval results, not the Position-Adjusted Word Count (PAWC) citation-share metric Aggarwal measures; the two benchmarks therefore provide corroborating evidence on the broader "keyword-density work does not transfer to generative engines" hypothesis rather than a direct PAWC replication of Aggarwal's specific testbed. ↩
Related terms
- Quotation Addition/terms/quotation-addition
- Cite Sources Optimization/terms/cite-sources-optimization
- Fluency Optimization/terms/fluency-optimization
- Statistical Density/terms/statistical-density
- Authoritative Statement Strength/terms/authoritative-statement-strength
- Definition-Lead Style/terms/definition-lead-style
- Cite-ability/terms/cite-ability
- Passage-level optimization/terms/passage-level-optimization
- Generative Engine Optimization/terms/generative-engine-optimization
Mentioned in· auto-generated from other terms' related lists
FAQ
- What is Keyword Stuffing in the Aggarwal GEO paper?
- Keyword Stuffing is one of nine LLM-prompted content-modification methods tested in Aggarwal et al. 2023 (arXiv:2311.09735): the method rewrites source content to include more query-relevant keywords, the classical SEO optimization tactic. In the paper's evaluation against the Position-Adjusted Word Count (PAWC) metric, Keyword Stuffing scored PAWC 17.8 vs the no-modification baseline of 19.5, the only method of the 9 to score BELOW baseline (-8.7%). The paper's verbatim characterization in Section 4: 'we find such methods have little to no performance improvement on Generative Engine's responses.'
- Does adding more keywords help AI citation?
- No, per the Aggarwal et al. 2023 benchmark. The paper specifically tested this hypothesis (classical SEO tactic) and reported verbatim 'little to no performance improvement.' The 2025 C-SEO Bench follow-up confirms the negative finding: under multi-actor production-realistic conditions, most C-SEO methods including Keyword Stuffing produced near-zero or negative effects on document ranking. The popular SEO claim that keyword-density optimization transfers to generative engines has no empirical support in either public benchmark.
- Why does the paper test something it expected would fail?
- Aggarwal et al. test Keyword Stuffing precisely because it is the most widely-used SEO tactic and a natural null hypothesis. The Table 1 caption frames the result explicitly: 'simple methods such as Keyword Stuffing traditionally used in SEO do not perform very well. However, our proposed methods such as Statistics Addition and Quotation Addition show strong performance improvements.' The negative result is load-bearing for the paper's central argument that generative engine optimization requires different tactics than classical SEO, not just rebranded keyword work. Documenting which methods do NOT transfer is as important as documenting which do.
- What about long-tail keyword variation or LSI keywords specifically?
- The Aggarwal paper did not separately measure long-tail keyword variation, LSI (Latent Semantic Indexing) keywords, or semantic keyword expansion as distinct interventions; it tested 'Keyword Stuffing' as adding more query-relevant keywords to source content. Whether more refined keyword variations (semantic neighbors, query expansion, related-search terms) produce different results is not measured by the public benchmarks. Practitioners should treat any 'sophisticated keyword variation produces better results than crude stuffing' claim as untested hypothesis rather than paper-derived finding.
- Is keyword research still relevant for AI search at all?
- For discoverability work (understanding what your audience searches for) and for query targeting, keyword research remains useful. The Aggarwal negative result is specifically about adding keywords to existing content to boost citation visibility, not about understanding which queries to target with content. Practitioners writing for AI search should continue researching the queries their audience runs; the negative finding only says that mechanically inflating keyword density in source pages does not improve generative engine citation under the paper's testbed conditions.
Sources & further reading
Get the weekly digest
New terms shipped that week, plus one observation from the AI-citation tracker.