Does adding more keywords help AI citation?

No, per the Aggarwal et al. 2023 benchmark. The paper specifically tested this hypothesis (classical SEO tactic) and reported verbatim 'little to no performance improvement.' The 2025 C-SEO Bench follow-up confirms the negative finding: under multi-actor production-realistic conditions, most C-SEO methods including Keyword Stuffing produced near-zero or negative effects on citation ranking. The popular SEO claim that keyword-density optimization transfers to generative engines has no empirical support in either public benchmark.

What about long-tail keyword variation or LSI keywords specifically?

The Aggarwal paper did not separately measure long-tail keyword variation, LSI (Latent Semantic Indexing) keywords, or semantic keyword expansion as distinct interventions; it tested 'Keyword Stuffing' as adding more query-relevant keywords to source content. Whether more refined keyword variations (semantic neighbors, query expansion, related-search terms) produce different results is not measured by the public benchmarks. Practitioners should treat any 'sophisticated keyword variation produces better results than crude stuffing' claim as untested hypothesis rather than paper-derived finding.

Is keyword research still relevant for AI search at all?

For discoverability work (understanding what your audience searches for) and for query targeting, keyword research remains useful. The Aggarwal negative result is specifically about adding keywords to existing content to boost citation visibility, not about understanding which queries to target with content. Practitioners writing for AI search should continue researching the queries their audience runs; the negative finding only says that mechanically inflating keyword density in source pages does not improve generative engine citation under the paper's testbed conditions.

/terms/keyword-stuffing · 8 min read · intermediate

Keyword Stuffing

Keyword Stuffing is the Aggarwal et al. 2023 GEO paper's flagship negative result: the paper tested rewriting source content to include more query-relevant keywords (the traditional SEO tactic) and characterized the result verbatim as 'little to no performance improvement on Generative Engine's responses' in Section 4. The Table 1 main GEO-bench raw PAWC measurement (17.7 vs baseline 19.3, mathematically -8%) is consistent with the null prose; the Table 5 Perplexity.ai prose escalates further, characterizing Keyword Stuffing as performing 10% worse than the Perplexity baseline. This entry documents the paper finding and the 2025 C-SEO Bench follow-up that confirms the null/negative result under multi-actor production-realistic conditions.

Citation status

ChatGPT0×Perplexity0×Claude0×Copilot·Gemini0×

Last checked 2026-07-13

Keyword Stuffing is the Aggarwal et al. 2023 GEO paper's flagship negative result. The paper's Section 4 prose is verbatim load-bearing: "we find such methods have little to no performance improvement on Generative Engine's responses"¹, a null finding for the most widely-used classical SEO tactic on generative-engine citation visibility. The Table 1 caption confirms the directional framing: "simple methods such as Keyword Stuffing traditionally used in SEO do not perform very well. However, our proposed methods such as Statistics Addition and Quotation Addition show strong performance improvements." The Table 5 Perplexity.ai prose escalates further, characterizing Keyword Stuffing as "performs 10% worse than the baseline" on that specific engine. The Table 1 main GEO-bench raw PAWC measurement (17.7 vs the no-modification baseline of 19.3, mathematically -8%, the only one of the 9 tested methods to fall below baseline) is consistent with the paper's null prose; the load-bearing characterization is the verbatim prose in both Section 4 and Table 5, not the derived percentage. Treat the raw percentage as a transparency check on the prose, not as the headline finding.

This entry documents that finding and is the geo-content-methods cluster's primary counter-evidence anchor against the popular SEO claim that traditional keyword optimization transfers cleanly to generative engines. The 2025 follow-up benchmark, C-SEO Bench², confirms the null/negative finding under multi-actor production-realistic conditions: most tested C-SEO methods, including Keyword Stuffing, produced near-zero or slightly negative effects on citation ranking in multi-domain testing. (C-SEO Bench measures citation ranking, not the PAWC citation-share metric Aggarwal uses, so it is corroborating counter-evidence on the broader keyword-stuffing-does-not-transfer hypothesis rather than a direct PAWC replication.) Under both the single-actor synthetic (Aggarwal 2023) and the multi-actor production-realistic (C-SEO Bench 2025) testbeds the entry cites, mechanically inflating keyword density in source pages does not improve generative engine citation under the tested public benchmarks.

Status in 2026

Despite the paper-verbatim null finding, keyword-density work remains widely recommended in 2026 SEO and GEO guides as a primary citation lever. The empirical picture from the two public benchmarks this entry cites is the opposite. Aggarwal 2023 characterizes Keyword Stuffing verbatim as "little to no performance improvement" in Section 4 main GEO-bench prose and escalates to "performs 10% worse than the baseline" in Table 5 Perplexity-specific prose; the underlying raw PAWC measurement is consistent (17.7 vs baseline 19.3, the only one of 9 tested methods below baseline). C-SEO Bench 2025's multi-actor analysis extends the finding to a different metric (citation ranking) and a different testbed shape, again with near-zero or negative effect. The popular folk-wisdom claim that "more keywords = more AI citation" has no empirical support in either of the public benchmarks measured to date.

The negative finding is load-bearing for understanding the difference between classical SEO and GEO. Classical SEO targets keyword-matching ranking algorithms; generative engines retrieve passages and condition LLM generation on the retrieved content. Keyword density is an input that matters more for the first kind of system than the second; the Aggarwal paper's central argument is that GEO requires different tactics, not just rebranded keyword work. The paper's choice to test Keyword Stuffing as a control is the empirical anchor for this distinction.

What the negative finding does not mean: it does not mean keyword research is useless. Understanding which queries an audience runs, which terms are most-searched in a topic cluster, and which long-tail variations have measurable volume remains useful work for discoverability and content targeting. The Aggarwal negative result is specifically about adding keywords to existing source content to boost citation visibility, not about understanding which queries to target with content in the first place. Practitioners writing for AI search should continue researching the queries their audience runs; mechanical keyword-density inflation in source pages is the failure mode.

What the paper actually tested

The Aggarwal paper applied Keyword Stuffing as an LLM-prompted source-content modification: GPT-3.5-turbo was instructed to rewrite source pages to include more keywords from the target query. The intervention was measured against the Position-Adjusted Word Count (PAWC) metric on the GEO-bench benchmark with top-5 Google sources, temperature=0.7, 5 responses per query, in 2023.

Table 1 (main GEO-bench) PAWC values, sorted high to low:

Method	PAWC (position-adjusted, "Overall" column)	Relative gain vs 19.3
Quotation Addition	27.2	+41%
Statistics Addition	25.2	+31%
Fluency Optimization	24.7	+28%
Cite Sources	24.6	+27%
Technical Terms	22.7	+18%
Easy-to-Understand	22.0	+14%
Authoritative	21.3	+10% (paper-verbatim null)
Unique Words	20.5	+6%
Baseline	19.3	(reference)
Keyword Stuffing	17.7	-8%

The values above are Table 1's position-adjusted "Overall" sub-column (the un-adjusted plain "Word" sub-column reads 27.8 / 25.9 / 25.1 / 24.9 / 23.1 / 22.2 / 21.8 / 20.7 / 19.5 / 17.8, which earlier versions of this entry cited as "PAWC" in error); the relative gains are computed directly from the Overall values. The paper itself frames its headline more conservatively as "up to 40%" and names a verbatim top-3 (Cite Sources, Quotation Addition, Statistics Addition) at a "30-40% relative improvement" range. Cite Sources appears in the named top-3 for combined-method strength rather than for standalone PAWC ranking (it is 4th standalone). Per-engine results vary: Table 5 (Perplexity.ai) reports a different baseline of 24.0 and a best method at +22%, not the main bench's 30-40% range.

For Keyword Stuffing specifically, the paper's prose graduates with engine. The main GEO-bench (Table 1) prose in Section 4 is "we find such methods have little to no performance improvement", a null framing consistent with the Table 1 raw -8% (position-adjusted "Overall" column). The Perplexity-specific prose accompanying Table 5 escalates: "our observations such as the ineffectiveness of traditional methods used in SEO such as Keyword Stuffing are further highlighted, as it performs 10% worse than the baseline." The Table 5 raw Keyword Stuffing PAWC on Perplexity is 21.9 (vs the Perplexity baseline of 24.0), confirming the directional escalation. Both prose framings (null on main bench, actively worse on Perplexity) point in the same direction. The combined paper-verbatim picture is that Keyword Stuffing is the only method the paper characterizes consistently as either non-helpful or counter-productive across both tables.

The paper does not report p-values or statistical significance tests for any of the per-method gains; the prose framing is the load-bearing characterization in both directions (top methods explicitly called out as effective, Keyword Stuffing explicitly called out as not).

How to apply

The practical takeaway for content-writing programs:

Do not pad source content with query keywords expecting AI citation lift. The paper benchmark and the C-SEO Bench follow-up agree: mechanical keyword inflation does not produce measurable lift in any tested condition. The effort is better spent on the paper's verbatim named top-3 of effective methods (Cite Sources, Quotation Addition, Statistics Addition) plus Fluency Optimization (3rd by standalone Table 1 PAWC, strongest in the Fluency-plus-Statistics combination experiment).
Keyword research stays useful for discoverability and audience targeting. Understanding the query distribution your audience runs, the long-tail variations they use, and the competitive landscape on those queries remains valuable; the negative finding is specifically about mechanically inflating density in already-written content, not about query selection.
Treat "sophisticated keyword variation produces better results than crude stuffing" as untested. The paper measured "Keyword Stuffing" as adding more query-relevant keywords; it did not separately test LSI (Latent Semantic Indexing) keywords, semantic keyword expansion, or long-tail variation as distinct interventions. Practitioner claims that "smart" keyword work produces different results from "crude" stuffing are speculation rather than benchmark-derived finding.
Use the negative finding as cluster discipline. When a content marketing source claims "keyword optimization is the foundation of GEO," the Aggarwal benchmark is the primary counter-evidence anchor: paper-verbatim null on the largest single SEO tactic.

What to skip:

"Updated for AI search" keyword-density tools that promise specific keyword counts as targets. The paper measured one Keyword Stuffing intervention; specific density targets are not paper-derived.
Conflating keyword research (useful) with keyword stuffing (paper-verbatim null). The two are different activities; the negative finding is about the latter only.

How it relates to other concepts

Counterpoint to the paper's verbatim named top-3 Aggarwal methods (Cite Sources, Quotation Addition, and Statistics Addition, at a stated 30-40% relative improvement range): these are the methods the paper actively recommends; Keyword Stuffing is the method the paper documents as not working. The paper's combined-method analysis additionally highlights Fluency Optimization paired with Statistics Addition as the strongest combination (+5.5% over any single method, §5.3). The two-side framing (named-top-3 effective vs Keyword Stuffing's negative result) is the paper's central argument that GEO is structurally different from classical SEO.
Paired with Authoritative Statement Strength as the paper's two null-or-negative methods: Authoritative tone is paper-verbatim "no significant improvement" (raw +10% but framed as null); Keyword Stuffing is paper-verbatim "little to no performance improvement" with the raw number actually below baseline. Both are widely-recommended SEO tactics that the paper measured and found not to transfer.
Reinforced by C-SEO Bench 2025: the multi-actor benchmark directly tested 7 of Aggarwal's 9 methods (under the paper's own labels: Authoritative, Statistics, Citations, Fluency, Unique Words, Simple Language, Quotes) and found most C-SEO methods produced near-zero effects on citation ranking. C-SEO Bench does not directly retest Aggarwal's Keyword Stuffing method, but the broader null result extends the negative-result territory to methods Aggarwal had reported as positive.
Distinct from Statistical Density: Statistics Addition (adding sourced statistics) is one of the paper's top methods at +31% relative gain; Keyword Stuffing (adding query-keywords) is the paper's negative result at -8%. The two interventions are surface-similar (both modify source content) but produce opposite measured effects.
Useful counter-anchor for cite-ability discipline: cite-ability emphasizes self-contained claims with attribution; keyword stuffing produces content that may be longer and more keyword-dense without the substantive content features (sourced quotations, statistics, citations, fluency) that the paper measures as effective.

Aggarwal et al. "GEO: Generative Engine Optimization." arXiv:2311.09735, November 2023 (KDD 2024). Princeton + IIT Delhi + Georgia Tech + Allen Institute for AI. Tests 9 LLM-prompted content-modification methods against a Position-Adjusted Word Count (PAWC) metric on the GEO-bench benchmark. Table 1 position-adjusted PAWC values (the "Overall" sub-column, which is the metric the headline gains are computed on): Quotation Addition 27.2, Statistics Addition 25.2, Fluency Optimization 24.7, Cite Sources 24.6, Technical Terms 22.7, Easy-to-Understand 22.0, Authoritative 21.3, Unique Words 20.5, no-modification baseline 19.3, Keyword Stuffing 17.7. (Table 1 nests three sub-columns under "Position-Adjusted Word Count": Word / Position / Overall; the un-adjusted plain Word sub-column reads 27.8 / 25.9 / 25.1 / 24.9 / 23.1 / 22.2 / 21.8 / 20.7 / 19.5 / 17.8, which earlier versions cited as "PAWC" in error.) The paper's verbatim Results section names a top-3 (Cite Sources, Quotation Addition, Statistics Addition) with a "30-40% relative improvement" range. Per-engine results vary: Table 5 (Perplexity.ai) reports a different baseline of 24.0 and the best method at +22%, not the main bench's 30-40% range. For Keyword Stuffing specifically: paper Section 4 verbatim: "we also evaluate the idea of using keyword stuffing, i.e. adding more relevant keywords to the website content. While this technique has been widely used for Search Engine Optimization, we find such methods have little to no performance improvement on Generative Engine's responses." Table 1 caption verbatim: "Performance improvement of GEO methods on GEO-bench... Compared to the baselines simple methods such as Keyword Stuffing traditionally used in SEO do not perform very well. However, our proposed methods such as Statistics Addition and Quotation Addition show strong performance improvements across all metrics considered." Raw measurement: PAWC 17.7 vs no-modification baseline of 19.3 (position-adjusted "Overall" column), the only one of the 9 methods to score BELOW baseline (mathematically -8%). The paper does not report p-values or significance tests; the prose framing is the load-bearing characterization. Testbed: GPT-3.5-turbo, top-5 Google sources, temperature=0.7, 5 responses per query, 2023. Primary-source re-verified 2026-05-30 against the ar5iv HTML mirror of arXiv:2311.09735: all Table 1 PAWC values, Table 1 caption verbatim, Section 4 prose, the verbatim named top-3 quote, and Table 5 Perplexity.ai per-engine numbers (including Keyword Stuffing 21.9 with paper prose 'performs 10% worse than the baseline') confirmed. ↩
See the C-SEO Bench glossary entry for the full paper attribution (Puerto, Gubri, Green, Oh, Yun. "C-SEO Bench: Does Conversational SEO Work?" arXiv:2506.11097, NeurIPS 2025 Datasets & Benchmarks Track), method-by-method results, multi-actor evaluation methodology, and the full verbatim findings. ↩

Part of GEO content methods· editorial cluster, not a semantic link

Cluster pillar: GEO content methods→

Also in this cluster: Authoritative Statement Strength · Black-hat C-SEO · C-SEO Bench · Cite Sources Optimization · Definition-Lead Style · +4 more

Mentioned in· auto-generated from other terms' related lists

Referenced in research· auto-generated from dispatch references

GEO's most-cited numbers, checked against the papers they come from

FAQ

What is Keyword Stuffing in the Aggarwal GEO paper?: Keyword Stuffing is one of nine LLM-prompted content-modification methods tested in Aggarwal et al. 2023 (arXiv:2311.09735): the method rewrites source content to include more query-relevant keywords, the classical SEO optimization tactic. In the paper's evaluation against the Position-Adjusted Word Count (PAWC) metric, Keyword Stuffing scored PAWC 17.7 vs the no-modification baseline of 19.3, the only method of the 9 to score BELOW baseline (-8%). The paper's verbatim characterization in Section 4: 'we find such methods have little to no performance improvement on Generative Engine's responses.'
Does adding more keywords help AI citation?: No, per the Aggarwal et al. 2023 benchmark. The paper specifically tested this hypothesis (classical SEO tactic) and reported verbatim 'little to no performance improvement.' The 2025 C-SEO Bench follow-up confirms the negative finding: under multi-actor production-realistic conditions, most C-SEO methods including Keyword Stuffing produced near-zero or negative effects on citation ranking. The popular SEO claim that keyword-density optimization transfers to generative engines has no empirical support in either public benchmark.
Why does the paper test something it expected would fail?: Aggarwal et al. test Keyword Stuffing precisely because it is the most widely-used SEO tactic and a natural null hypothesis. The Table 1 caption frames the result explicitly: 'simple methods such as Keyword Stuffing traditionally used in SEO do not perform very well. However, our proposed methods such as Statistics Addition and Quotation Addition show strong performance improvements.' The negative result is load-bearing for the paper's central argument that generative engine optimization requires different tactics than classical SEO, not just rebranded keyword work. Documenting which methods do NOT transfer is as important as documenting which do.
What about long-tail keyword variation or LSI keywords specifically?: The Aggarwal paper did not separately measure long-tail keyword variation, LSI (Latent Semantic Indexing) keywords, or semantic keyword expansion as distinct interventions; it tested 'Keyword Stuffing' as adding more query-relevant keywords to source content. Whether more refined keyword variations (semantic neighbors, query expansion, related-search terms) produce different results is not measured by the public benchmarks. Practitioners should treat any 'sophisticated keyword variation produces better results than crude stuffing' claim as untested hypothesis rather than paper-derived finding.
Is keyword research still relevant for AI search at all?: For discoverability work (understanding what your audience searches for) and for query targeting, keyword research remains useful. The Aggarwal negative result is specifically about adding keywords to existing content to boost citation visibility, not about understanding which queries to target with content. Practitioners writing for AI search should continue researching the queries their audience runs; the negative finding only says that mechanically inflating keyword density in source pages does not improve generative engine citation under the paper's testbed conditions.

Sources & further reading

New terms shipped that week, plus one observation from the AI-citation tracker.

More about what you'll get

Last fact-checked 2026-05-30. Spotted an error or stale claim? See editorial methodology.

Changelog (6 entries)

2026-07-13: Microsoft Copilot now cites this entry, placing it second in its sources and reproducing our negative-result framing from the Aggarwal et al. study (keyword stuffing performed 8 to 10 percent worse than baseline, the only method to fall below it). One of five tested engines cites it directly; ChatGPT, Perplexity, Claude, and Gemini did not.
2026-06-21: Corrected the Aggarwal Table 1 figures: the values previously cited as PAWC (Keyword Stuffing 17.8 vs baseline 19.5, and the rest) were the paper's plain Word Count sub-column. Updated to the paper's actual position-adjusted Word Count (the 'Overall' column: Keyword Stuffing 17.7 vs baseline 19.3, mathematically about -8%), which is the metric the paper's headline gains are computed on. The negative-result finding is unchanged: Keyword Stuffing is still the only method scoring below baseline, and the paper's verbatim 'little to no performance improvement' framing stands.
2026-05-30: Epistemic re-emphasis + cluster-wide PAWC primary-source re-verification. Description, lede, and Status now lead with paper-verbatim 'little to no performance improvement' (Section 4) and 'performs 10% worse than the baseline' (Table 5 Perplexity prose); the raw -8.7% / PAWC 17.8 / below-baseline framing is subordinated as a transparency check. Mirrors the pattern applied to authoritative-statement-strength. Body adds Perplexity Table 5 prose escalation (KS 21.9 vs baseline 24.0). Aggarwal footnote across 6 anchor entries appended with primary-source re-verification note vs the ar5iv mirror of arXiv:2311.09735.
2026-05-30: Cross-benchmark scoping polish (same-day after PAWC sweep). Added 'under the tested public benchmarks' qualifier so the conclusion does not over-generalize beyond what Aggarwal 2023 and C-SEO Bench 2025 directly measured. Replaced 'the only two public benchmarks' with 'the two public benchmarks this entry cites' to keep the framing time-bounded as new benchmarks appear. Body and C-SEO Bench footnote now make explicit that C-SEO Bench measures citation ranking, not Aggarwal's PAWC citation-share metric, making it corroborating counter-evidence rather than a direct PAWC replication.
2026-05-30: PAWC labeling sweep (same-day after publish). Inline Table 1 now labeled 'main GEO-bench' and frames relative gains as 'mathematically derived' rather than implying they are the paper's headline numbers; the paper itself frames its top-3 (Cite Sources, Quotation, Statistics) at 30-40%. Cluster's prior 'four top-performing levers' framing replaced with the paper's verbatim named top-3 in How-to-apply and How-it-relates. Fluency clarified as 3rd standalone but not in the paper's named top-3 (strongest in combined-method experiment instead). Footnote adds Table 5 per-engine caveat.
2026-05-29: Initial publish. Documents the Aggarwal et al. 2023 GEO paper's flagship negative result: Keyword Stuffing scored PAWC 17.8 vs baseline 19.5 (NEGATIVE 8.7%), the only one of 9 tested methods to fall below baseline. Paper verbatim: 'little to no performance improvement on Generative Engine's responses.' Joins geo-content-methods cluster as 6th Aggarwal method covered and the cluster's only negative-result entry. C-SEO Bench 2025 confirms the null/negative finding under multi-actor production-realistic conditions. Primary counter-evidence anchor against the SEO claim that keyword optimization transfers to generative engines.