Why not just use Vercel Analytics or Google Analytics?

Both work for total visitor counts, but neither separates the categories that matter for AI-citation reach measurement. Vercel Analytics requires the visitor's browser to execute JavaScript, which excludes bots and AI training crawlers (a feature, not a bug) but also bundles all real-browser visitors together without surfacing referer / edge-region / cache-state context. Google Analytics has the same JavaScript-execution requirement plus a bare `google.com` referer that does not distinguish AI Overview / AI Mode / blue-link clicks. Server logs (Vercel Logs in this site's case) capture every request including bots and contain the per-request edge node, cache state, full UA, and referer fields that the 5-axis framework reads. The disambiguation methodology fills the gap between 'all real browsers' and 'real external visitors specifically'.

Is the 5-axis framework standardized somewhere?

No. The framework is a practitioner-coined shorthand developed in-house at GEO Glossary across Days 8-13 of the site's operation (May 20 to May 25, 2026), based on observed disambiguation problems with the first batch of confirmed external visits. It is not in any vendor documentation, academic paper, or standards body. Other publisher sites have developed similar ad-hoc methodologies; the 5-axis version codified here is one specific articulation, not the only valid one.

What gets misclassified when you apply only 1 or 2 of the 5 axes?

Two failure modes documented in the Days 8-15 evidence work: (1) treating foreign-edge routing alone as proof of 'real visitor in country X' misses VPN exit node routing, which is why the framework was revised on Day 9 to claim only 'not founder' rather than 'user in country X'; (2) treating a referer like `cursor.com` as proof of 'Cursor IDE AI citation' misses that desktop dev tools open citation links via the system browser without referer (the cursor.com referer was actually web traffic from cursor.com's documentation pages, not from inside the IDE). Applying axes 1 and 4 alone produces these false positives. Applying all 5 axes catches them by requiring axes 2 + 3 + 5 to also corroborate.

Can this methodology measure AI citation traffic specifically?

It can identify whether traffic is real-external versus founder / scraper / training crawler, and combined with referer information it can narrow channel attribution (Google referer narrows to Google ecosystem; perplexity.ai referer narrows to Perplexity web app). It cannot specifically attribute traffic to AI Overview citation versus AI Mode citation versus blue-link click within a Google referer, because Google strips identifying parameters and sends a bare `google.com` referer for all Search-surface clicks. Cross-referencing with Google Search Console URL-level data and AI Overview Search Appearance rows narrows further (see [AI Overview citation](/terms/ai-overview-citation) and [AI Mode](/terms/ai-mode) for the documented limits).

/terms/external-traffic-disambiguation · 9 min read · intermediate

External traffic disambiguation

External traffic disambiguation is a practitioner-coined methodology for distinguishing real external visitors to a website from the site owner's own browsing, headless-browser scrapers, AI training crawlers, and VPN edge artifacts. The framework uses five orthogonal axes (foreign edge / cache state / path pattern / UA-plus-referer / non-scraper UA pattern) read off server logs (such as Vercel Logs) and applied jointly. Used as a publisher-side method when traditional analytics tools cannot reliably separate AI-citation-driven traffic from bot noise.

Citation status

ChatGPT·Perplexity·Claude0×Copilot0×Gemini·

Last checked 2026-06-29

External traffic disambiguation is a practitioner-coined methodology for classifying likely real external visitors separately from the site owner's own browsing, headless-browser scrapers, self-identifying crawlers, and VPN edge routing artifacts. The framework uses five orthogonal axes read off server logs (Vercel Logs in this site's case, but the methodology generalizes to any server-log source with comparable fields). Each visit is vetted against all five axes before being promoted to "confirmed external"; partial axis pass is downgraded to "candidate".

Status in 2026

External traffic measurement on a small indie publisher site faces a structural gap. Browser-execution analytics tools (Google Analytics, Vercel Analytics) exclude bots and AI training crawlers by design (no JavaScript execution = no event recorded), but they also bundle all real-browser visitors into a single category without surfacing the per-request signals needed to separate founder browsing from genuine external traffic. Server logs (Vercel Logs, Cloudflare logs, raw Nginx / Apache access logs) capture every request including bots, and expose the per-request edge node, cache state, full UA, and referer fields, but require manual analysis or custom tooling to derive any signal beyond raw request counts.

The 5-axis disambiguation framework fills the analytical gap between "all real browsers" (analytics view) and "real external visitors specifically" (the practitioner-relevant measurement target). It is a server-log-based methodology applied per visit, not a continuous metric.

The framework was driven by a sequence of revisions documented across internal evidence files:

Day 8 (2026-05-20): initial framework with 4 axes (foreign edge / cache state / path / UA + referer). Surfaced 3 confirmed external visits and 1 candidate.
Day 9 (2026-05-21): 5th axis (non-scraper UA pattern) added after the author noted that foreign-edge routing alone could come from VPN exit nodes ("could these clicks be from VPN?"). The framework was revised to claim only "not founder", not "user in country X", because VPN exit nodes produce the same edge routing as the claimed country.
Days 11-13 (2026-05-23 to 2026-05-25): 14+ confirmed visits accumulated across 7 countries (UK, FR, SG, DE, US, CA, RU). Two false-positive disambiguation events caught and used to refine the framework: the cursor.com referer originally read as a Cursor IDE AI citation, corrected to web traffic from cursor.com docs; and a foreign-edge DE visit downgraded from confirmed to candidate due to a VPN / datacenter scraper hypothesis that could not be ruled out.

The 5 axes

Each visit is vetted against all 5 axes. Confirmed external = all 5 pass; candidate = 3-4 pass; rejected = ≤2 pass. These thresholds are operational labels for this site, not a statistical classifier or universal standard; another publisher applying the same axes might choose different cutoffs to match a different false-positive tolerance.

Axis 1: Foreign geo edge. The Vercel edge node that served the request is not founder-reachable without VPN. Founder is in Australia and routes through Sydney (syd1); London (lhr1), Frankfurt (fra1), Paris (cdg1), Singapore (sin1), Hong Kong (hkg1), Washington DC (iad1), Tokyo (hnd1), and other foreign edges are not in the founder's routine. Important caveat: the edge reflects the VPN exit node if any; the framework only claims "not founder", not "user is physically in country X". Conflating these two was the original framework's main error.

Axis 2: First-time cache state. Either cache MISS / PRERENDER 0s (first-time fetch, meaning this visit triggered the cache build), or cache HIT against a cache generated by another foreign-geo visit (proven by cache-age math: HIT-age + visit-time = a prior foreign-geo visit's timestamp, not a founder visit). This axis catches founder-cache contamination where a founder visit warmed the edge cache and a subsequent HIT looks "foreign" but is actually serving founder-warmed content.

Axis 3: Path is not in founder-verify routine. Founder regularly visits the homepage, /observatory, recent term pages under active write, and recently-published evidence files. A visit to a long-tail term page (e.g. /terms/citation-share) at a timestamp when founder is provably elsewhere is much more likely to be external. This axis is weak alone (founder occasionally checks any page), but combined with the other axes it filters out the predictable founder paths.

Axis 4: Full browser UA + external referrer. Real-user indicator strength, descending: Google referer > Perplexity / Bing / Yandex / Brave referer > forum / social referer > defensive-domain referer (e.g. geoglossary.dev 301 redirect) > no referer + non-datacenter UA > no referer + datacenter region. This ordering is a site-specific heuristic; another publisher should recalibrate against their own logs and spam profile. The "no referer + datacenter region" combination is the weakest signal and typically indicates a scraper from AWS / Hetzner / Azure / DigitalOcean. Real browser UAs include full Chrome / Safari / Firefox version strings consistent with the device class; bot UAs are typically truncated, self-identifying (PerplexityBot/1.0), or use outdated Chrome versions (more than 10 versions behind current; the current Chrome major as of mid-2026 is roughly Chrome 134, so this threshold needs periodic recalibration).

Axis 5: Non-scraper UA pattern. Mobile UAs are often less common in basic scraper traffic than generic desktop Chrome UAs, but they are still spoofable. Outdated Chrome version strings raise scraper suspicion (see Axis 4 maintenance note). Real human signals across multiple visits include browser-version diversity, mixed device classes (desktop and mobile from the same edge over time), and mixed referer sources. A single bare google.com referer from a current Chrome on a foreign edge that returns once and never again is consistent with either a real user or a sophisticated scraper; multiple visits with diverse UAs from the same edge tilt the probability toward real users.

The framework deliberately does not produce a continuous score. Each axis is binary (pass / fail) and the framework treats "confirmed external" as a high-evidence label, not a probability claim. Visits that fail axes 2, 3, or 5 are downgraded to candidate or rejected; the asymmetric weighting reflects how much the false-positive cost matters for any downstream downstream measurement claims based on the data.

How to apply

The methodology runs as a per-visit manual review against the 5 axes, not as a continuous pipeline. Workflow:

Pre-condition: have access to raw server logs with per-request fields (Vercel Logs, Cloudflare logs, or equivalent) showing at minimum the edge node, cache state, path, full UA, referer, and timestamp. Browser-analytics-only views (Google Analytics, Vercel Analytics) do not have the required fields.
Review trigger: surface candidate visits via Vercel Logs filters (foreign edge != syd1, real-browser UA, not in known scraper IP ranges). For an indie site this is a manual nightly or weekly task, not a real-time alert system.
Apply all 5 axes: any visit that fails axis 2, 3, or 5 is downgraded to candidate; visits with mixed-signal axis 4 (no referer + datacenter region) are downgraded; the 5-axis pass threshold is the only "confirmed external" gate.
Log to evidence files: confirmed-external visits, with axis-pass notes, go into a dated evidence file in research/citations/ (or equivalent). The evidence file becomes the source of truth for any downstream downstream measurement claim (citation share, external visit count by country, channel attribution).
Maintain a contamination vector list: founder self-clicks from chatgpt.com, claude.ai, or any other AI consumer surface count as founder, not external, because founder probes those surfaces routinely. Maintain a known-contamination list and exclude matching visits.

What to skip:

Treating analytics tool counts (Vercel Analytics, Google Analytics) as the external-visitor count. They bundle founder + real external + miss the bot category entirely; the resulting number is structurally not what the 5-axis framework measures.
Building a single continuous "confidence score" across the 5 axes. The framework's asymmetric weighting (axis 2 / 3 / 5 failures downgrade) is informative; collapsing it loses the per-axis diagnosis of why a visit was downgraded.
Applying the framework on a single visit to claim engine-level citation share. A single confirmed external visit is one data point. Citation-share claims need 5+ confirmed visits per engine before they sit on stable ground.

What remains contested or unverified

Whether the 5-axis framework generalizes to non-Vercel server-log sources without methodology adjustment. Cloudflare, Fastly, and raw Apache / Nginx logs have similar per-request fields but the edge-routing and cache-state semantics may differ.
Whether the "scraper UA pattern" axis (axis 5) can distinguish modern headless browsers configured with realistic UA strings and randomized delays from real users.

- Whether the framework's binary axis-pass model handles partial-confidence cases well, or whether a probability-weighted alternative would produce different "confirmed external" promotions. No comparative study has been run. - Whether other indie publisher sites have developed convergent methodologies (suggesting the 5 axes capture stable practitioner intuitions) or divergent ones (suggesting framework-specific bias). Practitioner discussion on this is sparse.

How it relates to other concepts

Upstream tool for attribution rate, citation share, and cite-ability measurement. Traffic-derived reach metrics that include unverified visits are inflated by bot noise; the disambiguation methodology produces the clean denominator needed for those metrics to mean what their definitions claim.
Crawl-side complement to AI crawler bots: the crawler entry documents the user-agent strings AI engines use when fetching content; the disambiguation methodology documents how to read those UAs in context against the other 4 axes to classify a request as scraper versus real visitor.
Referenced as the verification gate for evidence files supporting IndexNow Protocol (e.g. the Yandex 2-click confirmation that triggered the IndexNow Yandex inclusion analysis used the 5-axis framework for verification).
Cross-cuts with AI Overview citation, AI Mode, ChatGPT search citation, and AI dev tool citations for channel-attribution work: the disambiguation framework narrows visits to "real external", but channel attribution (which AI surface drove this visit) requires referer pattern + GSC cross-check + per-surface detection methodology. Several citation-surface entries reference this framework as the upstream gate for their own citation-detection claims.

Methodology origin and per-axis revision history is documented in internal evidence files under the site repository's research/citations/ directory (not publicly browsable). The original baseline file (2026-05-20_external-traffic-confirmed.md) records the initial Day 8 batch of 3 confirmed visits and 1 candidate; its revision history captures the Day 9 axis-5 addition and several subsequent false-positive corrections including the cursor.com web-vs-IDE-citation case study. Adjacent evidence files document the first organic GSC click and a systematic crawl pattern observation, all using the same 5-axis framework for vetting. ↩

Part of Methodology· editorial cluster, not a semantic link

Cluster pillar: AI search evaluation→

Also in this cluster: AI search evaluation · Citation probe protocol · LLM-as-a-judge · Position-Adjusted Word Count

Mentioned in· auto-generated from other terms' related lists

Referenced in research· auto-generated from dispatch references

FAQ

Why not just use Vercel Analytics or Google Analytics?: Both work for total visitor counts, but neither separates the categories that matter for AI-citation reach measurement. Vercel Analytics requires the visitor's browser to execute JavaScript, which excludes bots and AI training crawlers (a feature, not a bug) but also bundles all real-browser visitors together without surfacing referer / edge-region / cache-state context. Google Analytics has the same JavaScript-execution requirement plus a bare `google.com` referer that does not distinguish AI Overview / AI Mode / blue-link clicks. Server logs (Vercel Logs in this site's case) capture every request including bots and contain the per-request edge node, cache state, full UA, and referer fields that the 5-axis framework reads. The disambiguation methodology fills the gap between 'all real browsers' and 'real external visitors specifically'.
Is the 5-axis framework standardized somewhere?: No. The framework is a practitioner-coined shorthand developed in-house at GEO Glossary across Days 8-13 of the site's operation (May 20 to May 25, 2026), based on observed disambiguation problems with the first batch of confirmed external visits. It is not in any vendor documentation, academic paper, or standards body. Other publisher sites have developed similar ad-hoc methodologies; the 5-axis version codified here is one specific articulation, not the only valid one.
What gets misclassified when you apply only 1 or 2 of the 5 axes?: Two failure modes documented in the Days 8-15 evidence work: (1) treating foreign-edge routing alone as proof of 'real visitor in country X' misses VPN exit node routing, which is why the framework was revised on Day 9 to claim only 'not founder' rather than 'user in country X'; (2) treating a referer like `cursor.com` as proof of 'Cursor IDE AI citation' misses that desktop dev tools open citation links via the system browser without referer (the cursor.com referer was actually web traffic from cursor.com's documentation pages, not from inside the IDE). Applying axes 1 and 4 alone produces these false positives. Applying all 5 axes catches them by requiring axes 2 + 3 + 5 to also corroborate.
Can this methodology measure AI citation traffic specifically?: It can identify whether traffic is real-external versus founder / scraper / training crawler, and combined with referer information it can narrow channel attribution (Google referer narrows to Google ecosystem; perplexity.ai referer narrows to Perplexity web app). It cannot specifically attribute traffic to AI Overview citation versus AI Mode citation versus blue-link click within a Google referer, because Google strips identifying parameters and sends a bare `google.com` referer for all Search-surface clicks. Cross-referencing with Google Search Console URL-level data and AI Overview Search Appearance rows narrows further (see [AI Overview citation](/terms/ai-overview-citation) and [AI Mode](/terms/ai-mode) for the documented limits).

Sources & further reading

New terms shipped that week, plus one observation from the AI-citation tracker.

More about what you'll get

Last fact-checked 2026-05-27. Spotted an error or stale claim? See editorial methodology.

Changelog (3 entries)

2026-05-27: ChatGPT citation confirmed. A fresh ChatGPT search probe on 'What is external traffic disambiguation in AI search analytics?' returned this entry as the top source, with the description paraphrased inline and attributed to GEO Glossary. citationStatus.chatgpt: untested -> cited; other engines not yet probed. Day-15 sweep also surfaced definition-lead-style as a same-day ChatGPT citation. Top-sourced 2 days after publish; third confirmed ChatGPT citation on the site and fastest publish-to-cited interval to date. Consistent with the pattern that practitioner-coined terms in low-competition territory attract citations.
2026-05-27: Same-day revision. Body and prior changelog entries rewritten to remove project-internal voice: version-codename references replaced with their observable meaning, process vocabulary replaced with the underlying actions, internal repository path references removed. Five hedging refinements: AI training crawlers narrowed to known/self-identifying crawlers; threshold rule labeled operational not statistical; Axis 5 mobile UA hedged; Axis 4 referer hierarchy marked site-specific; citation-rate-denominator tightened to traffic-derived reach metrics. Sources expanded from 2 to 6 entries.
2026-05-27: Initial publish. External traffic disambiguation is the practitioner-coined methodology for separating real external visitors from founder browsing, scrapers, AI training crawlers, and VPN edge artifacts in server logs. The 5-axis framework (foreign edge / cache state / path / UA-plus-referer / non-scraper UA pattern) was developed in-house across Days 8-13 of site operation (14+ confirmed visits across 7 countries). Entry codifies what was previously scattered across internal probe logs and editorial notes; referenced from indexnow-protocol and several evidence files. Glossary-coined practitioner shorthand; no vendor canonical exists.