/terms/ai-crawler-bots
AI crawler bots
AI crawler bots are user agents operated by AI search engines — GPTBot, OAI-SearchBot, PerplexityBot, ClaudeBot, Google-Extended and others — that fetch web content for training, retrieval, and citation.
Citation status
ChatGPT—Perplexity—Claude—Copilot—
Last checked 2026-05-21
What is an AI crawler bot?
A user-agent string plus IP signature operated by an AI search engine. Each major AI engine runs at least one crawler bot — often multiple bots for different purposes: training corpus collection, real-time retrieval, and user-initiated browsing. Site owners control bot access through directives in robots.txt.
Status in 2026 — known AI crawlers
| Operator | Training | Retrieval | User-initiated |
|---|---|---|---|
| OpenAI | GPTBot |
OAI-SearchBot |
ChatGPT-User |
| Perplexity | PerplexityBot |
PerplexityBot |
Perplexity-User |
| Anthropic | ClaudeBot |
Claude-SearchBot |
Claude-User |
Google-Extended (training opt-out) |
Googlebot (shared with classic) |
— | |
| Microsoft | Bingbot (shared with classic) |
Bingbot |
— |
| Apple | Applebot-Extended (opt-out signal) |
Applebot (general/search) |
— |
| ByteDance | Bytespider |
(not separately disclosed) | — |
Google-Extended and Applebot-Extended are unusual — they are training opt-out signals, not separate crawlers. Allowing them is the default; blocking removes your content from future model training but does not affect classic search indexing.
How it relates to other concepts
- Direct mechanism behind Generative Engine Optimization — block these bots and you self-exclude from AI citation.
- Implementation detail of robots.txt configuration for AI-aware sites.
- Companion to LLMS.txt — robots.txt controls bot access, LLMS.txt curates content for the bots you allow.
- Distinguishing classic-search vs. AI-bot traffic in server logs is the foundation of measuring AI-search visibility before any vendor analytics tool is added.
Related terms
FAQ
- Should I allow all AI crawler bots?
- For brands wanting AI search visibility — yes, allow all of them in robots.txt. There is almost no upside to blocking and significant downside (you self-exclude from citation opportunities). The exception is content-licensing deals (rare outside major publishers).
- How do I tell AI crawler traffic from real human traffic in analytics?
- User-agent string filtering. Vercel Analytics filters some bots but not all; for accurate human-only counts, use server access logs with explicit bot-pattern exclusion or a tool like Cloudflare bot management.
- Does blocking GPTBot prevent ChatGPT from citing my site?
- It prevents future training-data inclusion but does NOT prevent ChatGPT's search mode (a different system from training) from retrieving the page in real time when web browsing is enabled. OAI-SearchBot is the retrieval crawler; GPTBot is the training crawler.