How does C-SEO Bench differ from Aggarwal 2023's original GEO benchmark?

Two methodological differences are load-bearing. First, the outcome metric: Aggarwal 2023 reports Position-Adjusted Word Count (PAWC), the fraction of the LLM's response text drawn from a given source; C-SEO Bench reports citation ranking, which source the LLM cites at the top of its source list. Second, Aggarwal's benchmark tests one source adopting a method at a time (single-actor); C-SEO Bench tests competitive multi-actor adoption, where N candidate sources adopt the same method simultaneously. The paper attributes its contrasting findings to these two choices.

Did C-SEO Bench find any GEO methods that actually work?

Mostly no, under the benchmark's specific test conditions. Out of 54 method-domain cells tested (9 methods × 6 domains), only 3 reached statistical significance under Bonferroni-Holm correction at p<0.05. The strongest single C-SEO method tested, LLM Guidance, produced a citation-ranking gain of 0.36±1.47 on the retail domain, roughly 7.6× smaller than the traditional retrieval-ranking SEO baseline (moving the source to position 1 in the LLM's context window) which achieved 2.77±2.31 on the same domain. The paper's recommendation is that publishers invest in retrieval-ranking optimization first; C-SEO methods are secondary and individually marginal. Statistical insignificance here means the tested method did not reliably improve citation ranking under this benchmark's correction procedure on these specific tasks, domains, and LLMs; it does not prove a method is useless in every engine, topic, or implementation.

What does the 'zero-sum' finding mean?

C-SEO Bench's multi-actor protocol measures what happens as more candidate sources adopt the same C-SEO method. Verbatim from the paper: 'as we increase the number of C-SEO adopters, the overall gains decrease, depicting a congested and zero-sum nature of the problem.' Practically: if you adopt a C-SEO method when no one else has, you may capture some lift; once your competitors also adopt it, the lift erodes toward zero. The paper frames this as a game-theory result and notes it creates the same kind of content-arms-race dynamic as classical SEO. The implication: any single C-SEO method is unlikely to provide durable competitive advantage at scale.

Should I stop optimizing for AI search based on C-SEO Bench?

No, but recalibrate. The paper's prescription is to prioritize traditional retrieval-ranking SEO (measured ~7.6× more effective than any tested C-SEO method on retail) and treat content-level interventions like adding quotations or statistics as secondary multipliers. C-SEO Bench does not cover all GEO claims; glossary-coined practitioner discipline (cite-ability, definition-lead style, schema-as-machine-readability) is outside its scope. Treat it as a strong null on its tested methods, not on AI-search content discipline broadly.

/terms/c-seo-bench · 5 min read · advanced

C-SEO Bench

Q: What is C-SEO Bench?

C-SEO Bench is a benchmark introduced by Puerto, Gubri, Green, Oh, and Yun in their 2025 paper (arXiv:2506.11097, accepted at NeurIPS 2025 Datasets & Benchmarks Track). It evaluates 9 Conversational Search Engine Optimization (C-SEO) methods (7 drawn from Aggarwal et al. 2023's GEO methods plus 2 novel methods: Content Improvement and LLM Guidance) across 6 domains (Web, News, Debate for question-answering; Retail, Video Games, Books for product-recommendation) and 1,921 queries total. The benchmark also separately evaluates a traditional retrieval-ranking SEO baseline (moving the source document to context position 1) as comparison, bringing total items evaluated to ten. The C-SEO Bench paper uses its own naming for the Aggarwal-derived methods (Simple Language, Citations, Quotes, Statistics) which differs from Aggarwal's original labels (Easy-to-Understand, Cite Sources, Quotation Addition, Statistics Addition). The distinguishing feature is its multi-actor evaluation protocol that varies the adoption rate of each C-SEO method continuously from 0% to 100% of source documents.

Q: What does the 'zero-sum' finding mean?

C-SEO Bench's multi-actor protocol measures what happens as more candidate sources adopt the same C-SEO method. Verbatim from the paper: 'as we increase the number of C-SEO adopters, the overall gains decrease, depicting a congested and zero-sum nature of the problem.' Practically: if you adopt a C-SEO method when no one else has, you may capture some lift; once your competitors also adopt it, the lift erodes toward zero. The paper frames this as a game-theory result and notes it creates the same kind of content-arms-race dynamic as classical SEO. The implication: any single C-SEO method is unlikely to provide durable competitive advantage at scale.

Q: Should I stop optimizing for AI search based on C-SEO Bench?

No, but recalibrate. The paper's prescription is to prioritize traditional retrieval-ranking SEO (measured ~7.6× more effective than any tested C-SEO method on retail) and treat content-level interventions like adding quotations or statistics as secondary multipliers. C-SEO Bench does not cover all GEO claims; glossary-coined practitioner discipline (cite-ability, definition-lead style, schema-as-machine-readability) is outside its scope. Treat it as a strong null on its tested methods, not on AI-search content discipline broadly.

C-SEO Bench is the Puerto et al. 2025 NeurIPS Datasets & Benchmarks paper that evaluates 9 Conversational Search Engine Optimization methods across 6 domains, two tasks (question answering + product recommendation), and continuous multi-actor adoption rates. Its headline finding is that most current C-SEO methods are largely ineffective once tested outside the single-actor synthetic conditions of prior GEO benchmarks; a traditional retrieval-ranking SEO baseline (moving the source to context position 1) is roughly 7.6× more effective in their retail-domain measurement than the best C-SEO method tested.

Citation status

ChatGPT·Perplexity·Claude0×Copilot0×Gemini·

Last checked 2026-07-13

C-SEO Bench is the benchmark introduced by Puerto, Gubri, Green, Oh, and Yun in their 2025 paper of the same name, accepted at the NeurIPS 2025 Datasets & Benchmarks Track¹. It is the first benchmark designed specifically to evaluate Conversational Search Engine Optimization (C-SEO) methods across multiple tasks, multiple domains, and varying numbers of competing actors adopting each method.

The paper's headline finding is verbatim: "most current C-SEO methods are largely ineffective, contrary to reported results in the literature." Across 9 tested C-SEO methods (7 drawn from Aggarwal et al. 2023's GEO methods plus 2 novel methods, evaluated alongside a traditional retrieval-ranking SEO baseline for comparison), 6 domains, and 1,921 queries, only 3 of 54 method-domain cells reached statistical significance under Bonferroni-Holm correction (p<0.05). The traditional retrieval-ranking baseline (moving the source to position 1 in the LLM's context window) measured approximately 7.6× more effective than the best C-SEO method tested on the retail product-recommendation domain (2.77±2.31 mean rank improvement for the SEO baseline vs 0.36±1.47 for the strongest C-SEO method tested, LLM Guidance).

Status in 2026

C-SEO Bench is a peer-reviewed counter-evidence anchor for most existing GEO and Conversational-SEO claims, accepted at NeurIPS 2025 Datasets & Benchmarks Track after arXiv submission on 2025-06-06 (revised through v3 on 2025-10-20). It explicitly tests Aggarwal et al. 2023's GEO methods under a protocol that is more production-realistic than single-actor synthetic GEO tests, and reaches qualitatively opposite conclusions for most methods. The differences are methodologically load-bearing, not a re-run of identical conditions: Aggarwal reports Position-Adjusted Word Count (PAWC, the share of the LLM's response text drawn from a target source) under single-actor synthetic conditions, while C-SEO Bench reports citation ranking (which source the LLM cites at the top of its source list) under continuous multi-actor adoption rates evaluated on modern frontier-tier LLMs (gpt-4o-mini-2024-07-18 and claude-3-5-haiku-20241022). Aggarwal's testbed used GPT-3.5-turbo (2023). The C-SEO Bench paper's Discussion section is explicit about the relationship to Aggarwal: "While the results from Aggarwal et al. (2024) show some initial optimism about C-SEO methods, we do not observe the same effectiveness. The differences are due to the metrics used. They report their main results as word count ... this metric does not measure the LLM preference, contrary to the citation ranking that we use." The same Discussion section adds a stronger reconciliation point: "their results with this metric show a general decrease in the scores, implicitly indicating that the C-SEO methods do not generally improve citation ranking." In other words, Aggarwal's own PAWC data, when inspected for citation-ranking implications, also tends to support C-SEO Bench's null conclusion. (The C-SEO Bench paper cites Aggarwal as "2024" because the GEO paper was first preprint-submitted November 2023 and formally published in 2024; both labels refer to the same arXiv:2311.09735.)

Practitioner adoption of the C-SEO Bench finding through May 2026 is still light. Most SEO content marketing references Aggarwal 2023's "30-40% relative improvement" framing without acknowledging the C-SEO Bench counterpoint. The asymmetry creates an editorial opportunity: a reference that cites both papers provides a more accurate picture than one that cites only the optimistic side.

How to apply

The practical recalibration for content-writing programs working in the AI-search era:

Prioritize traditional retrieval-ranking SEO over C-SEO content interventions. Moving the source document up in the LLM's context window (the retrieval-ranking SEO objective) produced ~7.6× more measured impact on citation ranking than the best content-level C-SEO method tested in this benchmark. Index discipline, link equity, server response time, canonical hygiene, and topical authority remain the dominant levers; treat content-side interventions as secondary multipliers.
Treat any single-actor C-SEO claim as untested at scale. The "+33% PAWC" or "+43% PAWC" headline numbers from Aggarwal 2023 reflect single-actor synthetic conditions; under C-SEO Bench's multi-actor evaluation, the same methods do not transfer to citation-ranking improvements. Apply the C-SEO Bench discount to vendor or blog claims that cite a single Aggarwal PAWC number.
Expect zero-sum dynamics as C-SEO adoption spreads. Multi-actor results show that gains decrease as adoption rates rise: any method that captures lift today erodes toward zero as competitors adopt the same technique. Plan content strategy around durable signals (source authority, content quality, retrieval-friendliness) rather than chasing single-tactic lift.

What to skip:

"Adopt Aggarwal's top-3 GEO methods for guaranteed citation lift" promises. The top-3 did not transfer to citation ranking under multi-actor conditions tested by C-SEO Bench; the lift was specific to PAWC in single-actor synthetic testing.
Automated "C-SEO content optimizer" tooling that does not specify which methods it applies or which evaluation conditions it has been validated against. If a tool appears to rely on the same surface-level Aggarwal-derived methods C-SEO Bench tested, ask what validation conditions support its claimed lift.
Citing only Aggarwal 2023's PAWC numbers in client-facing collateral. The honest practice is to cite both Aggarwal and C-SEO Bench and let the client see the methodological asymmetry.

How it relates to other concepts

Counterpoint to Generative Engine Optimization as the field was originally framed. Aggarwal et al. 2023 introduced "GEO" alongside 9 content-modification methods measured against PAWC; C-SEO Bench is the multi-actor citation-ranking counter-test that the field's optimism has not fully absorbed yet.
Directly tests Quotation Addition (relabelled "Quotes" in the C-SEO Bench paper), Statistics Addition (statistical-density) ("Statistics"), Cite Sources Optimization ("Citations"), Fluency Optimization ("Fluency"), and Authoritative Statement Strength ("Authoritative") as 5 of its 7 Aggarwal-derived methods. Each of those entries cites C-SEO Bench as the counter-evidence to its own Aggarwal-2023 lift number.
Reinforces the broader null-result territory around mechanical C-SEO tactics including Keyword Stuffing, though C-SEO Bench does not directly retest Aggarwal's Keyword Stuffing method. Keyword Stuffing was Aggarwal's flagship negative result; C-SEO Bench extends the negative territory to methods Aggarwal had reported as positive. Both papers point in the same direction: classical SEO tactic transfer to LLM citation is unreliable.
Distinct from web-search-side benchmarks like BEIR or MS MARCO, which measure document-ranking quality independent of LLM responses. C-SEO Bench measures the downstream citation-ranking outcome in a CSE pipeline (Conversational Search Engine) where the LLM has access to retrieved candidate documents and chooses which to cite.

Puerto, H., Gubri, M., Green, T., Oh, S.J., and Yun, S. "C-SEO Bench: Does Conversational SEO Work?" arXiv:2506.11097, first submitted 2025-06-06; revised through v3 (2025-10-20). Accepted at NeurIPS 2025 Datasets & Benchmarks Track. Affiliations: Parameter Lab + UKP Lab at TU Darmstadt (Puerto, Gubri); University of Mannheim (Green); University of Tübingen + Tübingen AI Center (Oh); NAVER AI Lab (Yun). Tests 9 C-SEO methods using the paper's own labels: 7 derived from Aggarwal et al. 2023 (Authoritative, Statistics, Citations, Fluency, Unique Words, Simple Language, Quotes) plus 2 novel methods (Content Improvement, LLM Guidance). The Aggarwal-derived names are renamed by C-SEO Bench: Statistics = Aggarwal's "Statistics Addition"; Citations = "Cite Sources"; Simple Language = "Easy-to-Understand"; Quotes = "Quotation Addition". A separate Best SEO baseline (moving the source document to position 1 in the LLM's context window) is also evaluated for comparison, bringing total items evaluated to ten. Tested across 6 domains in two task families. Question-answering: Web (300 queries), News (294), Debate (142). Product-recommendation: Retail (500), Video Games (436), Books (249). Total: 1,921 queries. Multi-actor adoption rate α varies continuously across [0, 1]; the area under the curve is reported per method-domain cell. Out of 54 method-domain cells (9 C-SEO methods × 6 domains; the Best SEO baseline is reported separately), only 3 reached statistical significance under Bonferroni-Holm correction at p<0.05. The strongest single C-SEO method on retail (LLM Guidance) achieved a rank improvement of 0.36±1.47; the Best SEO baseline achieved 2.77±2.31 on the same domain, ~7.6× more effective. Evaluated on gpt-4o-mini-2024-07-18 and claude-3-5-haiku-20241022, vs Aggarwal 2023's GPT-3.5-turbo testbed. Paper's Discussion verbatim on Aggarwal: "While the results from Aggarwal et al. (2024) show some initial optimism about C-SEO methods, we do not observe the same effectiveness. The differences are due to the metrics used. They report their main results as word count ... this metric does not measure the LLM preference, contrary to the citation ranking that we use." The Discussion section also notes that Aggarwal's own data, when re-examined for citation-ranking implications, points in the same direction as C-SEO Bench's null finding: "their results with this metric show a general decrease in the scores, implicitly indicating that the C-SEO methods do not generally improve citation ranking." C-SEO Bench cites Aggarwal as "(2024)" because the GEO paper was first preprint-released November 2023 and formally published 2024; both labels refer to the same arXiv:2311.09735. C-SEO Bench does not test Aggarwal's Keyword Stuffing method or discuss black-hat techniques. License: CC BY 4.0. ↩

Part of GEO content methods· editorial cluster, not a semantic link

Cluster pillar: GEO content methods→

Also in this cluster: Authoritative Statement Strength · Black-hat C-SEO · Cite Sources Optimization · Definition-Lead Style · Fluency Optimization · +4 more

Mentioned in· auto-generated from other terms' related lists

Referenced in research· auto-generated from dispatch references

GEO's most-cited numbers, checked against the papers they come from

FAQ

What is C-SEO Bench?: C-SEO Bench is a benchmark introduced by Puerto, Gubri, Green, Oh, and Yun in their 2025 paper (arXiv:2506.11097, accepted at NeurIPS 2025 Datasets & Benchmarks Track). It evaluates 9 Conversational Search Engine Optimization (C-SEO) methods (7 drawn from Aggarwal et al. 2023's GEO methods plus 2 novel methods: Content Improvement and LLM Guidance) across 6 domains (Web, News, Debate for question-answering; Retail, Video Games, Books for product-recommendation) and 1,921 queries total. The benchmark also separately evaluates a traditional retrieval-ranking SEO baseline (moving the source document to context position 1) as comparison, bringing total items evaluated to ten. The C-SEO Bench paper uses its own naming for the Aggarwal-derived methods (Simple Language, Citations, Quotes, Statistics) which differs from Aggarwal's original labels (Easy-to-Understand, Cite Sources, Quotation Addition, Statistics Addition). The distinguishing feature is its multi-actor evaluation protocol that varies the adoption rate of each C-SEO method continuously from 0% to 100% of source documents.
How does C-SEO Bench differ from Aggarwal 2023's original GEO benchmark?: Two methodological differences are load-bearing. First, the outcome metric: Aggarwal 2023 reports Position-Adjusted Word Count (PAWC), the fraction of the LLM's response text drawn from a given source; C-SEO Bench reports citation ranking, which source the LLM cites at the top of its source list. Second, Aggarwal's benchmark tests one source adopting a method at a time (single-actor); C-SEO Bench tests competitive multi-actor adoption, where N candidate sources adopt the same method simultaneously. The paper attributes its contrasting findings to these two choices.
Did C-SEO Bench find any GEO methods that actually work?: Mostly no, under the benchmark's specific test conditions. Out of 54 method-domain cells tested (9 methods × 6 domains), only 3 reached statistical significance under Bonferroni-Holm correction at p<0.05. The strongest single C-SEO method tested, LLM Guidance, produced a citation-ranking gain of 0.36±1.47 on the retail domain, roughly 7.6× smaller than the traditional retrieval-ranking SEO baseline (moving the source to position 1 in the LLM's context window) which achieved 2.77±2.31 on the same domain. The paper's recommendation is that publishers invest in retrieval-ranking optimization first; C-SEO methods are secondary and individually marginal. Statistical insignificance here means the tested method did not reliably improve citation ranking under this benchmark's correction procedure on these specific tasks, domains, and LLMs; it does not prove a method is useless in every engine, topic, or implementation.
What does the 'zero-sum' finding mean?: C-SEO Bench's multi-actor protocol measures what happens as more candidate sources adopt the same C-SEO method. Verbatim from the paper: 'as we increase the number of C-SEO adopters, the overall gains decrease, depicting a congested and zero-sum nature of the problem.' Practically: if you adopt a C-SEO method when no one else has, you may capture some lift; once your competitors also adopt it, the lift erodes toward zero. The paper frames this as a game-theory result and notes it creates the same kind of content-arms-race dynamic as classical SEO. The implication: any single C-SEO method is unlikely to provide durable competitive advantage at scale.
Should I stop optimizing for AI search based on C-SEO Bench?: No, but recalibrate. The paper's prescription is to prioritize traditional retrieval-ranking SEO (measured ~7.6× more effective than any tested C-SEO method on retail) and treat content-level interventions like adding quotations or statistics as secondary multipliers. C-SEO Bench does not cover all GEO claims; glossary-coined practitioner discipline (cite-ability, definition-lead style, schema-as-machine-readability) is outside its scope. Treat it as a strong null on its tested methods, not on AI-search content discipline broadly.

Sources & further reading

New terms shipped that week, plus one observation from the AI-citation tracker.

More about what you'll get

Last fact-checked 2026-05-31. Spotted an error or stale claim? See editorial methodology.

Changelog (2 entries)

2026-05-31: Initial publish: C-SEO Bench is the Puerto et al. 2025 NeurIPS Datasets & Benchmarks paper introducing the first benchmark to evaluate Conversational SEO methods across multiple domains, tasks, and competing-actor counts. Headline finding: 'most current C-SEO methods are largely ineffective,' and a traditional retrieval-ranking SEO baseline was measured 7.6× more effective than the best C-SEO method tested on the retail domain. Serves as the counter-evidence anchor for the geo-content-methods cluster, complementing the Aggarwal 2023 entries with multi-actor findings that approximate production more closely than single-actor synthetic tests.
2026-05-31: Same-day revisions. Method names switched to the paper's own terminology (Simple Language / Citations / Quotes / Statistics rather than Aggarwal's Easy-to-Understand / Cite Sources / Quotation Addition / Statistics Addition); Aggarwal labels kept in parentheses. Added a second Discussion quote where the paper notes Aggarwal's own PAWC data, on inspection, also points to decreasing scores. Added specific evaluation LLMs (gpt-4o-mini + claude-3-5-haiku) as concrete anchor for the more-production-realistic framing. Clarified the paper benchmarks nine C-SEO methods plus a separate baseline (ten items total).