/terms/ai-crawler-bots · 7 min read · intermediate
AI crawler bots
Citation status
Last checked 2026-06-22
What is an AI crawler bot?
A user-agent string plus IP signature operated by an AI search engine. Each major AI engine runs at least one crawler bot, often multiple bots for different purposes: training corpus collection, real-time retrieval, and user-initiated browsing. Site owners control bot access through directives in robots.txt1234. Important caveat for any production use: verify bot traffic by published IP range or reverse DNS, not by user-agent string alone (UA strings are trivially spoofable), and Cloudflare's August 2025 investigation documented an AI vendor using an undeclared Chrome-impersonation UA to reach test domains that had explicitly disallowed the vendor's declared crawler5.
Allow major AI crawlers (robots.txt snippet)
# Real-time retrieval (drives AI-citation visibility)
User-agent: OAI-SearchBot
Allow: /
User-agent: PerplexityBot
Allow: /
User-agent: Claude-SearchBot
Allow: /
# User-initiated browsing; robots.txt compliance varies per vendor (see comments)
User-agent: ChatGPT-User
Allow: /
# OpenAI states "robots.txt rules may not apply" because requests are user-initiated.
User-agent: Perplexity-User
Allow: /
# Perplexity states this fetcher "generally ignores robots.txt rules."
# Cloudflare (Aug 2025) also documented Perplexity using an undeclared
# Chrome-impersonation UA against blocked test domains; robots.txt is not
# a hard enforcement boundary for Perplexity traffic at the network level.
User-agent: Claude-User
Allow: /
# Training-corpus collection (separate licensing / IP / competitive decision)
User-agent: GPTBot
Allow: /
User-agent: ClaudeBot
Allow: /
User-agent: Meta-ExternalAgent
Allow: /
# Meta states this crawler "may bypass robots.txt rules"; robots.txt blocking
# is best-effort, not enforceable.
User-agent: Meta-ExternalFetcher
Allow: /
# Same caveat as Meta-ExternalAgent: bypasses robots.txt at vendor discretion.
# Control tokens (not crawlers; signal opt-out from model training)
User-agent: Google-Extended
Allow: /
# Google states blocking has no effect on Google Search rankings or AI Overview
# (AI Overviews use Googlebot, not Google-Extended). Blocking removes content
# from training Gemini and Vertex AI generative APIs only.
User-agent: Applebot-Extended
Allow: /
# Apple states Applebot-Extended does not crawl; it is a signal that controls
# whether content fetched by Applebot can be used for foundation model training.
# Research / general non-Google-Search crawling
User-agent: GoogleOther
Allow: /
Retrieval / user-initiated bots are how your pages enter AI engines' citation candidate pools. Blocking any of them removes you from that engine's citation surface for the relevant product, most explicitly for OpenAI, which states sites blocking OAI-SearchBot "will not be shown in ChatGPT search answers, though can still appear as navigational links." Cross-engine generalizations beyond that are not vendor-documented; each engine should be evaluated separately.
Training-corpus bots (GPTBot, ClaudeBot, Meta-ExternalAgent) and control tokens (Google-Extended, Applebot-Extended) govern whether content can be used for foundation-model training. Whether to allow them is a licensing, IP, competitive, and compliance decision independent of search visibility. OpenAI and Google both explicitly document GPTBot/Google-Extended as independent from their search-retrieval toggles.
GoogleOther is Google's general-purpose non-Search crawler (research, product evaluation, internal use). The decision to allow or disallow it depends on the publisher's general bot-access policy.
Status in 2026: known AI crawlers
| Operator | Training | Retrieval | User-initiated |
|---|---|---|---|
| OpenAI | GPTBot |
OAI-SearchBot (+ OAI-AdsBot for ad-safety) |
ChatGPT-User (UA-only; "robots.txt may not apply" per OpenAI) |
| Perplexity | (none documented; PerplexityBot is retrieval only per Perplexity docs) | PerplexityBot |
Perplexity-User ("generally ignores robots.txt" per Perplexity; see also Cloudflare Aug 2025 report) |
| Anthropic | ClaudeBot |
Claude-SearchBot |
Claude-User |
Google-Extended (control token, not a crawler; see below) |
Googlebot (shared with classic Search and AI Overview) |
n/a | |
| Microsoft | Bingbot (shared with classic) |
Bingbot |
n/a |
| Apple | Applebot-Extended (control token, not a crawler) |
Applebot (general/Search) |
n/a |
| Meta | Meta-ExternalAgent (per Meta, may bypass robots.txt) |
(not separately surfaced) | Meta-ExternalFetcher (per Meta, may bypass robots.txt) |
| Brave | (not separately documented) | Brave's own crawler (Brave Search index) | n/a |
| DuckDuckGo | (none documented) | DuckAssistBot (Search Assist; respects standard robots.txt per DuckDuckGo) |
n/a |
| xAI (Grok) | (documented strings exist but are not observed in actual server logs) | Documented as GrokBot, xAI-Grok, Grok-DeepSearch (no first-party crawler documentation page; observed retrieval traffic spoofs browser user agents like Chrome and Safari and originates from rotating datacenter / proxy IPs per Stackfox and DataDome research) |
n/a |
| ByteDance | Bytespider (commonly reported to ignore robots.txt; not formally documented) |
(not separately disclosed) | n/a |
Google-Extended and Applebot-Extended are control tokens rather than crawlers; they do not fetch web content themselves. They sit in robots.txt so publishers can opt out of having their content used for foundation-model training (Gemini/Vertex AI for Google-Extended; Apple's foundation models for Applebot-Extended). Google explicitly documents that blocking Google-Extended has no effect on Google Search inclusion or rankings, and that AI Overviews source from Googlebot data rather than Google-Extended. Apple explicitly documents that Applebot-Extended does not crawl webpages.
xAI (Grok) is the outlier on crawler controllability. Independent research (Stackfox, DataDome) has documented that observed Grok retrieval traffic does not surface the documented GrokBot, xAI-Grok, or Grok-DeepSearch user agents; instead it arrives from rotating proxy / datacenter IPs (e.g., M247 AS9009, Datacamp AS212238) with spoofed Chrome, Safari, and iPhone user agents. xAI does not publish a first-party crawler documentation page comparable to OpenAI's platform.openai.com/docs/bots or Anthropic's ClaudeBot doc, and there is no published xAI commitment to robots.txt compliance. The practical implication is that user-agent-based robots.txt rules against GrokBot are unlikely to be honored; if your editorial position is to exclude Grok specifically, enforcement must happen at the WAF or network layer on observed retrieval patterns. See Grok citation for the citation-side framing.
How to apply
You control AI crawler access through your robots.txt. Two operational moves and one diagnostic:
- Ship explicit allow rules for the major retrieval bots: don't rely on
User-Agent: *as a default. Explicit allow signals intent and shows up in vendor dashboards. The minimum retrieval allow-list is OAI-SearchBot, ChatGPT-User, PerplexityBot, Perplexity-User, Claude-User, Claude-SearchBot. Add Google-Extended and Applebot-Extended as separate training-data control tokens (these don't affect retrieval, only model training). See the code block above for the full pattern with per-bot caveats. - Decide your training-data stance per-vendor, not once: GPTBot (OpenAI training), ClaudeBot (Anthropic training), Meta-ExternalAgent (Meta training), Google-Extended (Gemini and Vertex AI training), and Applebot-Extended (Apple foundation-model training) are independent toggles, each governing a different vendor's training pipeline. The licensing, IP, competitive, and compliance considerations differ per vendor; treat the decision as five separate calls, not one. The legacy
anthropic-aiUA is no longer documented by Anthropic and has been removed from the recommended allow-list above.cohere-aiis sometimes listed in third-party bot directories but is not officially documented by Cohere; this glossary does not recommend a Cohere-specific UA until it is. - Verify by IP / reverse DNS, not user-agent alone: UA strings are trivially spoofable. OpenAI, Perplexity, and Google publish IP ranges (Perplexity ships JSON endpoints at
perplexity.com/perplexitybot.jsonandperplexity.com/perplexity-user.json; OpenAI lists IP ranges in its bots doc; Google publishes ranges viagooglebot.json). Match observed bot traffic against these published ranges before treating a UA claim as genuine. The Cloudflare August 2025 Perplexity-stealth-crawler investigation specifically warned that a respected vendor was sending traffic from undeclared UAs against blocked sites; UA-only matching missed this. - Audit your server logs monthly: a simple
grep -i "gptbot\|perplexitybot\|claudebot\|meta-externalagent" access.log | cut -d' ' -f7 | sort | uniq -cshows which pages each bot fetched. Pages with zero AI bot visits are unlikely to surface in citation candidates; those are your weak spots. Cross-check the source IPs of those visits against the vendors' published ranges to catch spoofed traffic.
What to skip: treating robots.txt as an enforceable boundary for vendors that explicitly state they may bypass it (Perplexity-User, Meta-ExternalAgent, Meta-ExternalFetcher, ChatGPT-User in user-initiated mode) or for actors willing to send traffic from undeclared UAs. If hard blocking matters (legal, paywall, security), enforce at the WAF / network layer (Cloudflare bot management, Fastly, custom IP allowlists) rather than relying on robots.txt alone. Also skip paid bot-management tooling in month 1: free Cloudflare tier plus log analysis is enough to baseline.
How it relates to other concepts
- Technical prerequisite for Generative Engine Optimization: AI engines cannot cite content they cannot fetch. Crawler access is necessary but not sufficient; GEO also depends on content structure, entity recognition, authority signals, retrieval/citation selection logic, and third-party mentions.
- Implementation detail of robots.txt configuration for AI-aware sites.
- Companion to LLMS.txt: robots.txt controls bot access, LLMS.txt curates content for the bots you allow.
- Distinguishing classic-search vs. AI-bot traffic in server logs is the foundation of measuring AI-search visibility before any vendor analytics tool is added.
Footnotes
-
OpenAI: "Overview of OpenAI Crawlers" (canonical post-2025 URL). Documents GPTBot, OAI-SearchBot, OAI-AdsBot, and ChatGPT-User with IP ranges, purpose, and robots.txt semantics (notably: GPTBot/OAI-SearchBot are independent toggles, and "robots.txt rules may not apply" to user-initiated ChatGPT-User fetches). Sites opted out of OAI-SearchBot "will not be shown in ChatGPT search answers, though can still appear as navigational links." developers.openai.com/api/docs/bots. ↩
-
Perplexity: "Bots." Documents PerplexityBot (retrieval; "not used to crawl content for AI foundation models") and Perplexity-User (user-initiated; "generally ignores robots.txt rules"). Published IP endpoints:
perplexity.com/perplexitybot.jsonandperplexity.com/perplexity-user.json. docs.perplexity.ai/guides/bots. ↩ -
Anthropic: "Does Anthropic crawl data from the web?" Documents ClaudeBot (training), Claude-SearchBot (search-result quality; blocking "may reduce your site's visibility and accuracy in user search results"), and Claude-User (user-initiated). The legacy
anthropic-aiUA is not listed in current Anthropic documentation. support.claude.com/en/articles/8896518-does-anthropic-crawl-data-from-the-web. ↩ -
Meta: "AI Web Crawlers." Documents Meta-ExternalAgent (training and product indexing) and Meta-ExternalFetcher (user-initiated agentic actions). Meta states both crawlers "may bypass robots.txt rules." developers.facebook.com/docs/sharing/webmasters/web-crawlers. ↩
-
Gabriel Corral, Vaibhav Singhal, Brian Mitchell, Reid Tatoris. "Perplexity is using stealth, undeclared crawlers to evade website no-crawl directives." Cloudflare blog, 2025-08-04. Cloudflare's investigation used newly-purchased domains that had never been indexed and shipped robots.txt prohibiting all automated access; Perplexity nonetheless answered queries about the content. Cloudflare observed Perplexity-User/1.0 (declared, 20-25M daily requests) plus an undeclared Chrome-impersonation UA (
Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/537.36 ... Chrome/124.0.0.0 ..., 3-6M daily requests) reaching the test domains. Cloudflare delisted Perplexity from its verified-bot list and added managed-rules heuristics to block the stealth pattern. blog.cloudflare.com/perplexity-is-using-stealth-undeclared-crawlers-to-evade-website-no-crawl-directives. ↩
Part of Infrastructure· editorial cluster, not a semantic link
Cluster pillar: AI access control→
Also in this cluster: AI access control · AI crawler blocking · AIPREF (AI usage preferences) · IndexNow Protocol · LLMS.txt · +2 more
Related terms
- AI access control/terms/ai-access-control
- Web Bot Auth/terms/web-bot-auth
- Generative Engine Optimization/terms/generative-engine-optimization
- LLM Optimization (LLMO)/terms/llm-optimization
- RAG (Retrieval-Augmented Generation)/terms/rag
- LLMS.txt/terms/llms-txt
- IndexNow Protocol/terms/indexnow-protocol
- AI dev tool citations/terms/ai-dev-tool-citations
- Perplexity citation/terms/perplexity-citation
- Claude citation/terms/claude-citation
- Gemini citation/terms/gemini-citation
- ChatGPT search citation/terms/chatgpt-search-citation
- External traffic disambiguation/terms/external-traffic-disambiguation
Mentioned in· auto-generated from other terms' related lists
- AI access control
- AI crawler blocking
- AI dev tool citations
- AIPREF (AI usage preferences)
- Brave Search AI citation
- ChatGPT search citation
- Citation probe protocol
- Claude citation
- DuckDuckGo AI citation
- External traffic disambiguation
- Gemini citation
- Grok citation
- IndexNow Protocol
- Meta AI citation
- Microsoft Copilot citations
- Perplexity citation
- Robots.txt (Robots Exclusion Protocol)
- Web Bot Auth
FAQ
- Should I allow all AI crawler bots?
- Decide retrieval bots and training bots separately, because they answer different questions. For discoverability in AI answers, allow the retrieval / user-initiated bots (OAI-SearchBot, PerplexityBot, Claude-SearchBot, ChatGPT-User, Claude-User). Whether to allow the training bots (GPTBot, ClaudeBot, Meta-ExternalAgent, plus the Google-Extended and Applebot-Extended control tokens) is a separate licensing, IP, competitive, and compliance call independent of search visibility. OpenAI explicitly designs OAI-SearchBot and GPTBot as independent toggles for exactly this reason.
- How do I tell AI crawler traffic from real human traffic in analytics?
- User-agent string filtering. Vercel Analytics filters some bots but not all; for accurate human-only counts, use server access logs with explicit bot-pattern exclusion or a tool like Cloudflare bot management.
- Does blocking GPTBot prevent ChatGPT from citing my site?
- It prevents future training-data inclusion but does NOT prevent ChatGPT's search mode (a different system from training) from retrieving the page in real time when web browsing is enabled. OAI-SearchBot is the retrieval crawler; GPTBot is the training crawler. OpenAI documents these as independent toggles. To remove a site from ChatGPT search answers specifically, block OAI-SearchBot; OpenAI states such sites 'will not be shown in ChatGPT search answers, though can still appear as navigational links.'
- Is robots.txt enough to control AI crawler access?
- Not entirely. Three caveats published by the operators themselves: OpenAI states 'robots.txt rules may not apply' to ChatGPT-User because it is user-initiated; Perplexity states Perplexity-User 'generally ignores robots.txt rules' for the same reason; Meta states Meta-ExternalFetcher and Meta-ExternalAgent 'may bypass robots.txt'. Separately, Cloudflare's August 2025 investigation documented Perplexity using an undeclared Chrome-impersonating user-agent to fetch never-indexed test domains that had explicit robots.txt disallow directives. Robots.txt remains the right starting point but is not enforceable for user-initiated fetchers or for actors willing to bypass declared UAs; serious blocking requires WAF / network-level controls.
Sources & further reading
Get the monthly digest
New terms shipped that week, plus one observation from the AI-citation tracker.