/terms/external-traffic-disambiguation · 4 min read · intermediate
External traffic disambiguation
Citation status
Last checked 2026-05-27
External traffic disambiguation is a practitioner-coined methodology for classifying likely real external visitors separately from the site owner's own browsing, headless-browser scrapers, self-identifying crawlers, and VPN edge routing artifacts. The framework uses five orthogonal axes read off server logs (Vercel Logs in this site's case, but the methodology generalizes to any server-log source with comparable fields). Each visit is vetted against all five axes before being promoted to "confirmed external"; partial axis pass is downgraded to "candidate".
Status in 2026
External traffic measurement on a small indie publisher site faces a structural gap. Browser-execution analytics tools (Google Analytics, Vercel Analytics) exclude bots and AI training crawlers by design (no JavaScript execution = no event recorded), but they also bundle all real-browser visitors into a single category without surfacing the per-request signals needed to separate founder browsing from genuine external traffic. Server logs (Vercel Logs, Cloudflare logs, raw Nginx / Apache access logs) capture every request including bots, and expose the per-request edge node, cache state, full UA, and referer fields, but require manual analysis or custom tooling to derive any signal beyond raw request counts.
The 5-axis disambiguation framework fills the analytical gap between "all real browsers" (analytics view) and "real external visitors specifically" (the practitioner-relevant measurement target). It is a server-log-based methodology applied per visit, not a continuous metric.
The 5 axes
Each visit is vetted against all 5 axes. Confirmed external = all 5 pass; candidate = 3-4 pass; rejected = ≤2 pass. These thresholds are operational labels for this site, not a statistical classifier or universal standard; another publisher applying the same axes might choose different cutoffs to match a different false-positive tolerance.
Axis 1: Foreign geo edge. The Vercel edge node that served the request is not founder-reachable without VPN. Founder is in Australia and routes through Sydney (syd1); London (lhr1), Frankfurt (fra1), Paris (cdg1), Singapore (sin1), Hong Kong (hkg1), Washington DC (iad1), Tokyo (hnd1), and other foreign edges are not in the founder's routine. Important caveat: the edge reflects the VPN exit node if any; the framework only claims "not founder", not "user is physically in country X". Conflating these two was the original framework's main error.
Axis 2: First-time cache state. Either cache MISS / PRERENDER 0s (first-time fetch, meaning this visit triggered the cache build), or cache HIT against a cache generated by another foreign-geo visit (proven by cache-age math: HIT-age + visit-time = a prior foreign-geo visit's timestamp, not a founder visit). This axis catches founder-cache contamination where a founder visit warmed the edge cache and a subsequent HIT looks "foreign" but is actually serving founder-warmed content.
Axis 3: Path is not in founder-verify routine. Founder regularly visits the homepage, /observatory, recent term pages under active write, and recently-published evidence files. A visit to a long-tail term page (e.g. /terms/citation-share) at a timestamp when founder is provably elsewhere is much more likely to be external. This axis is weak alone (founder occasionally checks any page), but combined with the other axes it filters out the predictable founder paths.
Axis 4: Full browser UA + external referrer. Real-user indicator strength, descending: Google referer > Perplexity / Bing / Yandex / Brave referer > forum / social referer > defensive-domain referer (e.g. geoglossary.dev 301 redirect) > no referer + non-datacenter UA > no referer + datacenter region. This ordering is a site-specific heuristic; another publisher should recalibrate against their own logs and spam profile. The "no referer + datacenter region" combination is the weakest signal and typically indicates a scraper from AWS / Hetzner / Azure / DigitalOcean. Real browser UAs include full Chrome / Safari / Firefox version strings consistent with the device class; bot UAs are typically truncated, self-identifying (PerplexityBot/1.0), or use outdated Chrome versions (more than 10 versions behind current; the current Chrome major as of mid-2026 is roughly Chrome 134, so this threshold needs periodic recalibration).
Axis 5: Non-scraper UA pattern. Mobile UAs are often less common in basic scraper traffic than generic desktop Chrome UAs, but they are still spoofable. Outdated Chrome version strings raise scraper suspicion (see Axis 4 maintenance note). Real human signals across multiple visits include browser-version diversity, mixed device classes (desktop and mobile from the same edge over time), and mixed referer sources. A single bare google.com referer from a current Chrome on a foreign edge that returns once and never again is consistent with either a real user or a sophisticated scraper; multiple visits with diverse UAs from the same edge tilt the probability toward real users.
The framework deliberately does not produce a continuous score. Each axis is binary (pass / fail) and the framework treats "confirmed external" as a high-evidence label, not a probability claim. Visits that fail axes 2, 3, or 5 are downgraded to candidate or rejected; the asymmetric weighting reflects how much the false-positive cost matters for any downstream downstream measurement claims based on the data.
How to apply
The methodology runs as a per-visit manual review against the 5 axes, not as a continuous pipeline. Workflow:
- Pre-condition: have access to raw server logs with per-request fields (Vercel Logs, Cloudflare logs, or equivalent) showing at minimum the edge node, cache state, path, full UA, referer, and timestamp. Browser-analytics-only views (Google Analytics, Vercel Analytics) do not have the required fields.
- Review trigger: surface candidate visits via Vercel Logs filters (foreign edge != syd1, real-browser UA, not in known scraper IP ranges). For an indie site this is a manual nightly or weekly task, not a real-time alert system.
- Apply all 5 axes: any visit that fails axis 2, 3, or 5 is downgraded to candidate; visits with mixed-signal axis 4 (no referer + datacenter region) are downgraded; the 5-axis pass threshold is the only "confirmed external" gate.
- Log to evidence files: confirmed-external visits, with axis-pass notes, go into a dated evidence file in
research/citations/(or equivalent). The evidence file becomes the source of truth for any downstream downstream measurement claim (citation share, external visit count by country, channel attribution). - Maintain a contamination vector list: founder self-clicks from
chatgpt.com,claude.ai, or any other AI consumer surface count as founder, not external, because founder probes those surfaces routinely. Maintain a known-contamination list and exclude matching visits.
What to skip:
- Treating analytics tool counts (Vercel Analytics, Google Analytics) as the external-visitor count. They bundle founder + real external + miss the bot category entirely; the resulting number is structurally not what the 5-axis framework measures.
- Building a single continuous "confidence score" across the 5 axes. The framework's asymmetric weighting (axis 2 / 3 / 5 failures downgrade) is informative; collapsing it loses the per-axis diagnosis of why a visit was downgraded.
- Applying the framework on a single visit to claim engine-level citation share. A single confirmed external visit is one data point. Citation-share claims need 5+ confirmed visits per engine before they sit on stable ground.
What remains contested or unverified
- Whether the 5-axis framework generalizes to non-Vercel server-log sources without methodology adjustment. Cloudflare, Fastly, and raw Apache / Nginx logs have similar per-request fields but the edge-routing and cache-state semantics may differ.
- Whether the "scraper UA pattern" axis (axis 5) can distinguish modern headless browsers configured with realistic UA strings and randomized delays from real users.
How it relates to other concepts
- Upstream tool for attribution rate, citation share, and cite-ability measurement. Traffic-derived reach metrics that include unverified visits are inflated by bot noise; the disambiguation methodology produces the clean denominator needed for those metrics to mean what their definitions claim.
- Crawl-side complement to AI crawler bots: the crawler entry documents the user-agent strings AI engines use when fetching content; the disambiguation methodology documents how to read those UAs in context against the other 4 axes to classify a request as scraper versus real visitor.
- Referenced as the verification gate for evidence files supporting IndexNow Protocol (e.g. the Yandex 2-click confirmation that triggered the IndexNow Yandex inclusion analysis used the 5-axis framework for verification).
- Cross-cuts with AI Overview citation, AI Mode, ChatGPT search citation, and AI dev tool citations for channel-attribution work: the disambiguation framework narrows visits to "real external", but channel attribution (which AI surface drove this visit) requires referer pattern + GSC cross-check + per-surface detection methodology. Several citation-surface entries reference this framework as the upstream gate for their own citation-detection claims.
Footnotes
-
Methodology origin and per-axis revision history is documented in internal evidence files under the site repository's
research/citations/directory (not publicly browsable). The original baseline file (2026-05-20_external-traffic-confirmed.md) records the initial Day 8 batch of 3 confirmed visits and 1 candidate; its revision history captures the Day 9 axis-5 addition and several subsequent false-positive corrections including the cursor.com web-vs-IDE-citation case study. Adjacent evidence files document the first organic GSC click and a systematic crawl pattern observation, all using the same 5-axis framework for vetting. ↩
Related terms
- IndexNow Protocol/terms/indexnow-protocol
- AI crawler bots/terms/ai-crawler-bots
- Attribution rate/terms/attribution-rate
- Citation share/terms/citation-share
- Cite-ability/terms/cite-ability
- AI Overview citation/terms/ai-overview-citation
- AI Mode/terms/ai-mode
- AI dev tool citations/terms/ai-dev-tool-citations
- ChatGPT search citation/terms/chatgpt-search-citation
Mentioned in· auto-generated from other terms' related lists
FAQ
- Why not just use Vercel Analytics or Google Analytics?
- Both work for total visitor counts, but neither separates the categories that matter for AI-citation reach measurement. Vercel Analytics requires the visitor's browser to execute JavaScript, which excludes bots and AI training crawlers (a feature, not a bug) but also bundles all real-browser visitors together without surfacing referer / edge-region / cache-state context. Google Analytics has the same JavaScript-execution requirement plus a bare `google.com` referer that does not distinguish AI Overview / AI Mode / blue-link clicks. Server logs (Vercel Logs in this site's case) capture every request including bots and contain the per-request edge node, cache state, full UA, and referer fields that the 5-axis framework reads. The disambiguation methodology fills the gap between 'all real browsers' and 'real external visitors specifically'.
- Is the 5-axis framework standardized somewhere?
- No. The framework is a practitioner-coined shorthand developed in-house at GEO Glossary across Days 8-13 of the site's operation (May 20 to May 25, 2026), based on observed disambiguation problems with the first batch of confirmed external visits. It is not in any vendor documentation, academic paper, or standards body. Other publisher sites have developed similar ad-hoc methodologies; the 5-axis version codified here is one specific articulation, not the only valid one.
- What gets misclassified when you apply only 1 or 2 of the 5 axes?
- Two failure modes documented in the Days 8-15 evidence work: (1) treating foreign-edge routing alone as proof of 'real visitor in country X' misses VPN exit node routing, which is why the framework was revised on Day 9 to claim only 'not founder' rather than 'user in country X'; (2) treating a referer like `cursor.com` as proof of 'Cursor IDE AI citation' misses that desktop dev tools open citation links via the system browser without referer (the cursor.com referer was actually web traffic from cursor.com's documentation pages, not from inside the IDE). Applying axes 1 and 4 alone produces these false positives. Applying all 5 axes catches them by requiring axes 2 + 3 + 5 to also corroborate.
- Can this methodology measure AI citation traffic specifically?
- It can identify whether traffic is real-external versus founder / scraper / training crawler, and combined with referer information it can narrow channel attribution (Google referer narrows to Google ecosystem; perplexity.ai referer narrows to Perplexity web app). It cannot specifically attribute traffic to AI Overview citation versus AI Mode citation versus blue-link click within a Google referer, because Google strips identifying parameters and sends a bare `google.com` referer for all Search-surface clicks. Cross-referencing with Google Search Console URL-level data and AI Overview Search Appearance rows narrows further (see [AI Overview citation](/terms/ai-overview-citation) and [AI Mode](/terms/ai-mode) for the documented limits).
Sources & further reading
- Vercel: Runtime logs documentation (data fields available for log-based analysis)
- Vercel: Edge network regions (foreign-edge routing reference)
- Cloudflare: Log fields reference (parallel server-log analysis source)
- Nginx: HTTP log module documentation (raw access log field reference)
- MDN: Referer header documentation (referrer-policy semantics affecting Axis 4)
- Cloudflare Radar: traffic and bot composition background (general scraper context for Axis 5)
Get the weekly digest
New terms shipped that week, plus one observation from the AI-citation tracker.